Objectives

This report critically analyses the trends in enrolments for senior secondary (Year 11 and Year 12) mathematics and science subjects in Queensland. Initially, the dataset will be described, from how the dataset is obtained to how the raw data was restructured, cleaned and augmented to facilitate the analysis. This includes imputing missing values, using regular expressions to manipulate character strings and relational joins to combine datasets with matching observations. Next, exploratory data analysis will be conducted using modern data exploration tools to uncover the underlying structures and interesting features of enrolment in all subjects; This will consist of graphical displays such as time plots to better understand the dataset.

The aforementioned measures taken above gives readers an intuition of the dataset, the report will subsequently focus on building a statistical model to predict enrolments for senior secondary mathematics and science subjects. As the dataset is longitudinal, that is, involving a collection of data (enrolments) from the same school across multiple time points, a multilevel model is deemed appropriate for this analysis. This section of the report will further elaborate the methodology in producing the most appropriate model and the predictions from the ‘best’ model.

The following research questions guided the report:

  1. What are the underlying trends in senior secondary mathematics and science subjects?
  2. Are there differences in enrolments in the new QCE system (after 2020) as compared to the old QCE system (before 2020)?

Data and Motivation

Extracting the dataset

Table 1: Description of the dataset used for the analysis
Variable Description
Enrolment information
completion_year Graduation cohort’s year of completion
year_11_enrolments Number of Year 11 students enrolled in the subject
year_12_enrolments Number of Year 12 students enrolled in the subject
Subject information
qcaa_subject_id QCAA subject ID
subject_name Subject name
subject Mathematics or Science subject
subject_type Type of subject (General, applied or short courses)
School information
qcaa_school_id QCAA school ID
school_name School Name
doe_centre_code Department of Education Centre Code
qcaa_district School district
school_postcode School postcode
sector School sector (Government, Independent, Catholic)

The dataset used for this analysis encompasses enrolment information for all participating schools in the QCE system, and can be extracted in the Queensland Curriculum and Assessment Authority (QCAA) website, where the enrolments data for the old QCE system (enrolments spanning from 1992 to 2019) can be found under the “Statistics before 2020” toggle (The State of Queensland (Queensland Curriculum & Assessment Authority) (2019b)) while the enrolments data for the new QCE system can be found under the “Statistics from 2020” toggle.

Each row (observation) corresponds to a school’s enrolment information for a subject in a given year, in addition to the school and subject details. The dataset can be broken down into the three main components, as seen above (Table 1). The data above has been cleaned and is ready for the format that allows for analysis, the further sections will describe the data wrangling process in details.

The new QCE system

Senior schooling is a pivotal moment for Queensland students as most students transition into tertiary education. The new QCE system was introduced from Year 11 students in 2019, where Overall Position (OP) scores will be replaced by the new Australian Tertiary Admission Rank (ATAR) system in assisting universities in selecting applications in their courses. For this reason, the Queensland Core Skills (QCS) test for year 12 students, which has been in place since 1992 is abolished (after Year 12 in 2019) for the new ATAR system. The ATAR is a inter subject percentile-rank (as opposed to a mark), where students are ranked relative to their age group; To be specific, ATAR is expressed on a 2,000 point scale from 99.95 down to 0.00 in steps of 0.05.

Queensland ATARs are based on a student’s:

  • Best five general subject results, or
  • Best results in four General subjects, plus one Applied subject, or
  • Best results in four General subjects, plus one VET qualification at Certificate III or above

In the ATAR calculation, only one applied subject can be included in the ATAR calculation. For instance, if a student enrols in Essential Mathematics (applied subject) and Science in Practice (applied subject), only the higher of both

A new suite of senior mathematics and science subjects

In light of the new system, a new suite of senior subjects are reviewed and redeveloped, along with new ones introduced. The new senior syllabuses include general (extension) subjects for students who wish to pursue tertiary pathways, vocational education and training, and work. These subjects will have four summative assessments — three internal and one external examination at the end of Year 12, which are set and marked by independent teacher assessors accredited by the Queensland Curriculum and Assessment Authority. On the other hand, applied (essential) subjects are mainly for students interested in pathways beyond school that primarily lead to vocational education and training or directly to work. These subjects tend to be more ‘real world’ focus and do not have external examinations but have a total of four summative internal assessments.

Notably, this report will not cover short courses, which are suited for students who wants to establish a basis for further education or work, typically leading to further vocational education and training.

Mathematics

Table 2: Senior Mathematics subjects in the new QCE system
Old subject name New subject name
Applied Mathematics
Essential Mathematics
General Mathematics
Mathematics A General Mathematics
Mathematics B Mathematical Methods
Mathematics C Specialist Mathematics

There are three main mathematics subjects in the new QCE system, which are broadly equivalent to the Mathematics A, B and C in the old QCE system (Table 2). Of the general mathematics subjects, General Mathematics is the least taxing, followed by Mathematical Methods and Specialist Mathematics. Furthermore, Essential Mathematics (applied subject) is introduced in the new system, where it focuses on developing and using mathematical knowledge to investigate real-world problems such as managing money, travel and data, and decision-making using mathematical concept (The State of Queensland (Queensland Curriculum & Assessment Authority) (2019a)). Each school may offer various mathematics programs — e.g. offer two of the three general mathematics subjects, or only offer Essential Mathematics (applied subjects).

Mathematical Methods is often a prerequisite for some university courses, implying that students completing the subject can streamline entry to specific courses in the university. Specialist Mathematics is designed to be taken in conjunction, or on completion of Mathematical methods, as it extends the key ideas studied in that subject. This path is usually recommended for students who wish to undertake further study in mathematics — such as engineering or science related courses or have strong aptitude in mathematics.

As an alternative, there is a safety net as students may opt for a less challenging mathematics subject such as Essential Mathematics subject (applied subject), as it would also contribute to the student’s ATAR score. This subject is usually undertaken by students who do not require mathematics for tertiary education at university of have not been able to attain a pass in Year 10 Core Mathematics. However, it should be noted that only general mathematics subjects or applied mathematics subjects can included in the ATAR, but not both. For instance, it is not possible to incorporate Mathematical Methods (general subject) and Essential Mathematics (applied subject) in a student’s ATAR.

Sciences

Table 3: Senior Science subjects in the new QCE system
Old subject name New subject name
Applied Sciences
Agricultural Practices Agricultural Practices
Aquatic Practices Aquatic Practices
Science in Practice Science in Practice
General Sciences
Agricultural Science Agricultural Science
Biology Biology
Chemistry Chemistry
Earth Science Earth & Environmental Science
Marine Science Marine Science
Physics Physics
Psychology

Unlike mathematics subjects, there are little changes in the subject names. As demonstrated in Table 3 There are two main changes in science subjects — ‘Psychology’ is introduced in the new QCE system and Earth Science is renamed to Earth and Environmental Science.

Data Cleaning

To get the dataset ready for the analysis, the dataset is transformed to reproduce that of the new QCE system.

Joining dataset from old QCE system and new QCE system

Changing all subject names to match new QCE system: To combine the datasets relating to the old and new QCE system, the subject names were first matched. As demonstrated in Tables 2 and 3, some subjects names were changed in the new QCE system. All subject names in the old system were changed to that of the new system to facilitate the relational join. For example, Mathematics A, B and C were modified to General Mathematics, Mathematical Methods and Specialist Mathematics respectively to match that of the new QCE system.

Changing unit 1 and 2, and unit 3 and 4 enrolments to year 11 and year 12 enrolments: Next, the new QCE system introduces an assessment program consisting of Units 1 and 2. In general, most students complete Units 1 and 2 in Year 11, and Units 3 and 4 are undertaken in year 12. Accordingly, units 1 and 2, and units 3 and 4 are renamed to year 11 enrolments and year 12 enrolments to facilitate the relational join of both datasets corresponding to the old and new QCE system. Once the school and subject information were aligned, both datasets were joined seamlessly.

Manipulating character strings to match school names

Table 4: Example: Incompatible School names fixed by removing parantheses “()” and dash “-”
School Name (Old QCE system) School Name (New QCE system) Modified School Name
Fixed by removing parantheses
St Catherine’s Catholic College The Whitsundays (Proserpine (Renwick Road)) St Catherine’s Catholic College The Whitsundays (Proserpine) St Catherine’s Catholic College The Whitsundays
Lutheran Ormeau Rivers District School (Lords) Lutheran Ormeau Rivers District School (Pimpama) Lutheran Ormeau Rivers District School
Faith Baptist Christian School (Gladstone) Faith Baptist Christian School (Burua) Faith Baptist Christian School
Fixed by removing dashes
Rivermount College - Yatala Rivermount College (Yatala) Rivermount College
Carinity Education - Southside Carinity Education - Southside Sunnybank Carinity Education

Regular expressions were utilised to remove school names that did not match in the datasets for the old and new QCE system. An extract of what has been done is shown above (Table 4). Fortunately, schools are uniquely identified with QCAA school ID, easing this process as incompatible school names can be easily identified. In most cases, removing parentheses “()” and dashes “-” relating to suburb solves the issues.

Table 5: Example: Manually fixing errors due to issues in School Nmaes
School Name (Old QCE system) School Name (New QCE system) Issue (Why school names does not match)
Clairvaux Mackillop College Clairvaux MacKillop College Capital K in ‘MacKillop’
St Aidan’s Anglican Girls’ School St Aidan’s Anglican Girls School Extra ’ in ‘Girls’
Cloncurry State School P-12 Cloncurry State School Extra ‘P-12’
Aboriginal & Islander Independent Community School Aboriginal and Islander Independent Community School ‘&’ used instead of ‘and’

However, there were cases in which school names have to be recoded manually as the school names were inconsistent. Some of the examples are demonstrated in Table 5.

Missing values in enrolments relate to schools offering subject for Year 11 or Year 12 only

At-a-glance plot of the missing values in the raw dataset, where the pink cells indicate a missing value. Most observations are present, except for three variables. The missing values in year 11 and 12 enrolments relates to schools which offered a subject for year 11 but not for year 12 and vice versa

Figure 1: At-a-glance plot of the missing values in the raw dataset, where the pink cells indicate a missing value. Most observations are present, except for three variables. The missing values in year 11 and 12 enrolments relates to schools which offered a subject for year 11 but not for year 12 and vice versa

Figure 1 illustrates that most of the observations were present, except for there variables relating to enrolments (year_11_enrolments and year_12_enrolments) and the Department of Education Centre Code (doe_centre_code).

Zero enrolments in Year 11

First, 67.13% of the missing values in year 11 enrolments relate to the 2019 graduating cohort that were part of the Queensland’s non-compulsory first Prep Year cohort in 2007. This cohort commenced in 2007 at the current time and will graduate from year 12 in 2019. Therefore, from 2020, every year will be a full cohort of students, implying an additional group of students each year and thus higher expected enrolment for the new QCE system. As this 2007 prep year cohort leaves the schooling system, the enrolments from 2020 will increase considerably as a full cohort will be realised, as will be seen in the later sections.

Zero enrolments in Year 12 Next, 71.48% of the zero enrolments in year 12 corresponds to the first year in which a given school introduces a subject. As year 12 enrolments requires students to complete year 11 syllabus, most students may not partake in the subject when the schools first introduce the subject. In further scrutiny, the other schools that have zero enrolments are usually smaller schools (in rural areas) or schools may only offer a subject for year 11 students but not year 12 students in certain years and vice versa. For example, Mossman State High School offered Specialist Mathematics subject in some years — Offered the subject to Year 11 students of 2020 and Year 12 students of 2021, but not for Year 11 students of 2021 and Year 12 students of 2022 (Mossman State High School (2019)).

Missing Department of Education Centre Code

Schools that were missing the Department of Education Centre Code refers to schools that existed before 2002. This alludes that these schools may not have been assigned any Department of Education Centre Code. Fortunately, the dataset provides qcaa_school_id which have no missing values and can uniquely identify each school, and thus, thee missing values in doe_centre_code may be disregarded.

Revise school postcodes

Errors in school postcodes usually arise form school postcodes codes being reassigned overtime, due to typographical errors, mainly for schools with two or more different locations. These issues were fixed manually by cross checking the correct addresses for the school provided by Google Maps.

Exploratory Data Analaysis

Total number of participating schools for each subject per year

Figure 2: Total number of participating schools for each subject per year

Figure 2 displays the total number of participating schools for a given year. Some subjects showed rather stable enrolment figures such as Biology and Chemistry, which showed a steady increase in participating schools over the years. Other subjects like Science in Practice and Earth and Environmental Science showed a rather erratic trend, where the first year of Earth and Environmental Science subject showed the largest number of participating schools while Science in Practice subject showed a one-off year (2016) in which the total schools was significantly greater than the other years.

Average Enrolments for each Subject over the years

Average enrolments for all schools for each year, enrolments for the new QCE system are coloured in a darker shade of grey

Figure 3: Average enrolments for all schools for each year, enrolments for the new QCE system are coloured in a darker shade of grey

As mentioned, since the 2007 prep year cohort left the schooling system at the end of 2019, enrolments will be replaced by a full cohort. It is not surprising therefore that all subjects saw a surge in enrolments from 2019 to 2020. Based on a study by Independent Schools Queensland (Independent Schools Queensland (ISQ) (2020)), this resulted in approximately 6.2% (22,050) increase in secondary student enrolments from 2019 and 2020 in independent sectors alone.

Subjects were introduced at different time frames. Aquatic practices and Marine Science were only introduced in 2014 and 2015 respectively. There was a halt in Agricultural Practices subject in year 2000, before being re-introduced again in 2015. Furthermore, the inception of the new QCE system was two new subjects, Psychology and Essential Mathematics.

Time plot: Enrolments across the years

Time plot of enrolments of all the subjects across the years

Figure 4: Time plot of enrolments of all the subjects across the years

Figure 4 displays a spaghetti plot of year 12 enrolments for each subject, with overall fit using the the mean of enrolments for all schools in a given year. Each line represents year 12 enrolments for a single school. For most subjects, some relatively large schools can be identified as their enrolments numbers were significantly higher than most other schools. The year 12 enrolments for these large schools were especially prominent in the later years, where the new QCE system was introduced.

Some subjects that were introduced in the 1990s such as Biology, Chemistry, Physics, General Mathematics and Mathematical Methods showed rather stable enrolment numbers as enrolment trend across schools appears to be proportional to one another. In other words, when there is a general increase (decrease) in enrolments, most schools will show an increase (decrease), as seen by the large cluster of observations. In contrary, Agricultural Earth and Environmental Science, although introduced in the 1990s, did not exhibit such patterns as the enrolment numbers were rather erratic. However, it can be observed that there were significantly less enrolments for these subjects, implying that not many schools offer this subject or the subject is not popular among students.

It is common for year 11 students to continue pursuing year 12 units

Year 12 enrolments agianst year 11 enrolments, with a 45° line to indicating a perfect linear relationship

Figure 5: Year 12 enrolments agianst year 11 enrolments, with a 45° line to indicating a perfect linear relationship

A scatterplot of year 11 and year 12 are plotted in Figure 5, with a 45° line to facilitate the interpretation of the scatter plot, where it indicates a one unit increase in year 11 enrolments is matched with a one-unit increase in year 12 enrolments (i.e. perfect linear relationship). As already mentioned, the 0 enrolments in year 11 (as seen on the figure) is attributed to the Queensland’s first Prep Year cohort in 2007 while the 0 enrolments in year 12 relates to the year where a school introduces the subject for the very first time (thus, there will be no year 12 students as most students have yet to complete year 11).

Most subject demonstrates a linear relationship (lie on or close to the red line) between year 11 and year 12 students. Intuitively, this is reasonable as a student is likely to continue to complete the year 12 syllabus (unit 3 and 4) after completing his year 11 syllabus (unit 1 and 2).

Observations below the red line may suggest that students have dropped the subject or have taken a gap year. Students are usually recommended to enrol in 6 subjects for senior school (Art of Smart (2021)). A student usually starts with 6 subjects in year 11, before dropping down to 5 subjects in Year 12. However, completing 6 subjects gives the student a “safety net,” in the instance where the student did not perform well for a particular subject (noting that ATAR takes the best of 5 subjects). Taking this into account, students appears to be dropping Psychology subject, as Figure 5 demonstrates that most year 12 enrolments were below the red 45° line.

In contrary, observations that lie above the red line may indicate that these students may have completed the year 11 (unit 1 and 2) in year 10, and re-enrolled in year 12. Furthermore, students may opt to study at a school of distance education to complete their year 11 (unit 1 and 2) syllabus because of various reasons such as the school not offering year 11 syllabus for a given year.

Some science subjects such as Earth and Environmental Science, Marine science, and Science in Practice displays heteroscedasticity, where the larger the cohort size, the higher the rate of students dropping the subject or having more enrolments in year 12. Interestingly, these science subjects have a relatively small number of enrolments. This pattern does not seem to manifest in mathematics subjects.

Larger schools are associated with lower enrolments in year 12

Parallel Coordinate plot comparing year 11 and year 12 enrolments for a given cohort in a school. The blue (grey) lines represents an increase (decrease) in year 12 enrolments. In general, larger schools appears to have smaller year 12 enrolments, as indicated by numerous grey lines when enrolment numbers are higher

Figure 6: Parallel Coordinate plot comparing year 11 and year 12 enrolments for a given cohort in a school. The blue (grey) lines represents an increase (decrease) in year 12 enrolments. In general, larger schools appears to have smaller year 12 enrolments, as indicated by numerous grey lines when enrolment numbers are higher

The parallel coordinates plot (Figure 6) allows comparison of the difference between year 11 and year 12 enrolments. The grey line indicates that year 12 enrolments is smaller than year 11 enrolments for a given cohort in a school and conversely, the blue lines represents an increase in enrolments from year 11 to year 12.

In general, larger schools (with relatively high enrolment numbers for a given subject) appears to have smaller year 12 enrolments as compared to year 11 enrolments. This pattern is especially noticeable in Psychology, where most schools with enrolments higher than 150 see a decrease in enrolment numbers from year 11 to year 12. This was not the case for some subjects such as General Mathematics, Chemistry, and Biology.

Distribution of enrolments for each sector

Comparing distirbution of subjects

Distributions of enrolments for each subject shown with a density plot, overlaid with a density plot

Figure 7: Distributions of enrolments for each subject shown with a density plot, overlaid with a density plot

The distribution of enrolments for each subject appears to be right skewed (Figure 7). This is possibly attributed to the size of schools, as some schools are significantly larger than others.

Comparing enrolments across sectors using boxplot

Box-plot of distribution of enrolments for each subject by sector

Figure 8: Box-plot of distribution of enrolments for each subject by sector

In most subjects, the distribution in enrolments across all sectors were rather similar (Figure 8). Subjects that were introduced early such as Biology, Chemistry, Specialist Mathematics, and Physics showed stable enrolment numbers, with little variation across the different sectors. As highlighted by the outliers in all subjects, there were schools that has significantly more enrolments, these observations may refer to the larger schools in Queensland. These outliers are considerably prominent, especially in Government schools.

Predicting enrolments

Why the Multilvel Model

At the outset, a standard linear regression model assumes that residuals are independently and identically distributed (\(i.i.d\)) – i.e. There will be no further correlations (dependence) between measures once predictors are added to the model. Substantively, this suggests that any higher level entities are identical, and can be completely ‘pooled’ into a single observation. With a longitudinal data, this is often an unreasonable assumption, as temporal data are often characterised by marked dependence over time and response for measurements across time are often related to one another. For instance, the division between the within (e.g. schools) and between (e.g. between clusters such as district) would be assumed to be equal such that a one-unit increase in enrolments for within effects would match a one unit increase in between-effects, which may be unlikely the case in this context.

Using the multilevel model allows for simultaneous modelling of both intra-school change (i.e. how school enrolments are changing overtime) and interindividual change (i.e. Differences in temporal change across schools). As the dataset is imbalanced due to schools introducing or ceasing a given subject at different times, the multilevel model is appropriate as it is able to utilise available data from incomplete observations without reducing sample size.

In general, multilevel models takes on the general form: Repeated measures (level 1) nested within individuals (level 2) and possibly individuals nested within some higher-level clusters (level 3). With respect to the data, level 1 relates to a single measure in time, level 2 relates to the individual schools offering the maths and science subjects and level 3 corresponds to the schools nested within districts or postcodes. As our data is nested within schools and possibly in higher-level clusters, the multilevel model is appropriate.

Limitations

The data is aggregated at the school-level, implying that more granular information such as the student-level is not available. With the new QCE system, students are allowed to choose from a range of courses, and not all will pursue a mathematics and science pathway. This serves as a disadvantage as information such as proportions cannot be computed. Ideally, the number of students pursuing some STEM programs may be extracted, such that the proportion of students who are enrolled in a particular STEM subject and is pursuing a STEM career pathway can be computed.

Some subjects (e.g. Essential Mathematics and Psychology) are newly offered in the new QCE system. This insinuates that there are only observations spanning from 2020 to 2022 for these subjects. Although the multilevel model is efficient in using the available data (even without complete observations), it has its own drawbacks, which will be discussed in the models of each subject.

Methodology (Building the multilevel model)

As aforementioned, the objective of this report is to predict enrolments in light of the new QCE system.

1. Exploring the data with basic linear model within schools

A basic linear model within schools provide an at-a-glance visualisation of the enrolment trends within each school. An advantage of this visualisation is that the linear model summarises enrolment trends with only two summary statistics, an intercept and a slope, which is useful for panel data as enrolments can be viewed overtime per school. The intercept conveys a school’s enrolments when the subject was first introduced to the school while the slope indicates the school’s yearly average increase in enrolments over the period in which the subject is or was running.

2. Getting the dataset ready for modelling

Each subject encounters different circumstances (e.g. new subjects introduced only in 2020), the following steps are the main steps taken to get the dataset for a single subject ready for modelling. However, it should be noted that in some subjects, more or less steps could be taken given the context.

Removing observations with zero enrolments

Year 11 enrolments, with a vertical line indicating year 2019, where enrolments in most schools dropped to zero due to the 2007 prep year cohort.

Figure 9: Year 11 enrolments, with a vertical line indicating year 2019, where enrolments in most schools dropped to zero due to the 2007 prep year cohort.

Subjects with 0 enrolments for a given cohort will be removed as these enrolment figures will artificially inflate/deflate enrolment numbers. As outlined, most of the zero enrolments in 2019 was due to the completion of the 2007 prep year cohort. These zero enrolments can be seen in Figure 9, where the vertical red line indicates the year 2019.

By the same token, most year 12 subjects with with zero enrolments relates to the first year schools introduced the subject. In this year, most students would be ineligible to enrol in year 12 units as most year 12 units requires the completion of year 11 units. Furthermore, the other zero enrolments were due to small schools who had little to no enrolments in a particular year.

Linearising response variable enrolments using log transformation

As seen in the density plots for each subject (Figure 7), enrolment numbers are right skewed due to the various school sizes in Queensland. This means that most of the variance captured in the dataset can be attributed to the various school sizes in Queensland. As such, a log transformation will be utilised to allow models to be estimated by the linear mixed models. A log transformation is appropriate in this context as enrolments are by nature, a positive number.

Rescale time variable (completion_year)

It is often helpful to rescale time variable so the first measurement occasion is the zero point. In this context, the completion_year for each subject will be rescaled to 0, where 0 corresponds to the first year in which the subject was introduced. For instance, Specialist Mathematics subject was introduced in 1992, and thus, will be the zero point (variable named year92). Rescaling time variable gives the intercept the interpretation of the baseline, or initial status of the dependent variable (i.e. 0 as the commencement of the subject)

3. Initial model – Unconditional means (null) model

Inherent nested structure

Nested structure of the dataset

Figure 10: Nested structure of the dataset

The multilevel model was primarily used because of the underlying nested structure in the dataset; A nested structure occurs when individual data points as one level (i.e. schools in this context) appear in only one level of a higher-level variable (e.g. school postcodes or QCAA district). With a longitudinal data, time predictors will be associated with a single measurement in time (i.e. first year in which a school introduces the subject).

Accordingly, three potential models can be identified from the nesting structure above (Figure 10):

  1. Two-level nested structure: Schools
  2. Three-level nested structure: Schools nested within postcodes
  3. Three-level nested structure: Schools nested within QCAA districts

Presently, information criteria selection such as AIC dominate the model selection criteria in comparing non-nested multilevel model. This report will follow suit, by comparing the different non-stationary random effects using AIC to select the ‘best’ initial model.

Model specificaiton (and notations)

In a multilevel context, it is often helpful to begin with an unconditional means (null) model where there are no predictors at any level, only an intercept. This model model provides useful information in understanding the structure of the data. It can be utilised to obtain estimates of residuals and intercept variance that allows for assessment of the variation at each level, and in comparing variability in (1) schools within clusters for a level three model or (2) within schools for a level three model. However, before delving into the specifics, all potential models have to be considered.

For instance, the intraclass correlation coefficient (ICC = \(\rho_i\)) can be computed. ICC refers to proportion of variance in the response (enrolments) that can be explained by the grouping structure of the hierarchical model; It is often easier to conceptualise it as the correlation between the enrolments of two schools, randomly selected from a cluster or as the same amount of variance explained by those groupings (similar to an \(R^2\)).

The three-level unconditional means model can be expressed as:

  • Level one (time point within schools) \[Y_{tij} = \pi_{0ij} + \epsilon_{tij}\]

  • Level two (schools within clusters) \[\pi_{0ij} = \beta_{00j} + u_{0ij}\]

  • Level three (clusters) \[\beta_{00j} = \gamma_{000} + r_{00j}\]

In this case, clusters (i.e. in this dataset, districts or postcodes) are considered to be independent, but schools within the same districts and the measurements at different time points for the same school are correlated. If a two-level multilevel model is used, level three can be disregarded.

At level one:

  • \(y_{tij}\): response variable (enrolments) of school \(i\) from cluster \(j\) at time \(t\)
  • \(\pi_{0ij}\): True/average mean of response (enrolments) of school \(i\) from cluster \(j\) across all time points
  • \(\epsilon_{tij}\): Difference between observed enrolments (\(y_{tij}\)) and mean enrolments for school \(i\) in district \(j\)

At level two:

  • \(\beta_{00j}\): True mean enrolments for cluster \(j\) across all schools from the given (\(j^{th}\)) district
  • \(u_{0ij}\): Difference between enrolments of school \(i\) from district \(j\) from the mean enrolments of all schools in district \(j\)

At level three:

  • \(\gamma_{000}\): Fixed effects model parameter representing the true mean enrolments across all schools, districts and time points
  • \(r_{00j}\): Describes the difference in enrolments between the mean enrolments from all observations in district \(j\) and the overall mean enrolments across all districts, schools and time points.

In combining level 1,2 and 3, the composite model can be obtained:

\[\begin{aligned} Y_{ij} & = \pi_{0ij} + \epsilon_{tij} \\ & = (\beta_{00j} + u_{0ij}) + \epsilon_{tij} \\ & = (\gamma_{000}) + (r_{00j} + u_{0ij} + \epsilon_{tij}) \end{aligned}\]

The model can be split into two parts. The ‘fixed effects’ will be represented in the first parentheses of the last expression, usually denoted by \(\gamma\) while the ‘random effects’ will be represented in the second parentheses, denoted by \(r\), \(u\) and \(\epsilon\) components.

where,

\[\begin{aligned} \epsilon_{tij} & \sim N(0, \sigma^2) \\ u_{0ij} & \sim N(0, \tau^2_{00}) \\ r_{00j} & \sim N(0, \phi^2_{00}) \end{aligned}\]

Where, - \(\epsilon_{tij}\): Variance over time within schools - \(\tau^2_{00}\): Variance between schools from the same cluster - \(\phi^2_{00}\): Variance across clusters

4. Unconditional growth model

The unconditional growth model incorporates time as a linear growth at level one (with no predictors at level two and three) to reduce unexplained variability within schools. This model allows for assessing the within-school variability that can be attributed to the systematic changes over time.

To reduce correlation between the linear components of the time effect, the time variable rescaled. For this report, time variable (completion_year) is rescaled to the zero point, that is, the first year in which the school introduced the subject. This gives the intercept the interpretation of baseline or initial status of the enrolments.

Model specification

As with the unconditional means model, the unconditional growth model can be expressed as:

  • Level one (time point within schools) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year_{tij} + \epsilon_{tij}\]

Where, year is the time variable.

  • Level two (schools within clusters) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + u_{1ij} \end{aligned}\]

  • Level three (clusters) \[\begin{aligned}\beta_{00j} &= \gamma_{000} + r_{00j} \\ \beta_{10j} &= \gamma_{100} + r_{10j} \end{aligned}\]

As compared to the null-model, the between-school and between-clusters variability are now partitioned into variability in initial status (\(\tau^2_{00}\) and \(\phi^2_{00j}\)) and variability in rates of change (\(\tau^2_{10}\) and \(\phi^2_{10}\)); i.e.

At level one:

  • The level one trajectory for school \(i\) in district \(j\) is assumed to be linear, with intercept \(\pi_{0ij}\) and slope (growth rate) \(\pi_{1ij}\)
  • \(\epsilon_{tij}\) captures the deviation between the true growth trajectory and the observed values (enrolments)

At level two:

  • \(\beta_{00}\) represents the true intercept and \(\beta_{10}\) represents the true mean slope for all school \(i\)
  • The random effects \(u_{0ij}\) and \(u_{1ij}\) captures the deviation between school \(i\)’s true growth trajectory and the mean intercept for and slope for school \(i\)
    • The deviations in intercept and slope at level two are allowed to be correlated through the covariance parameter \(\tau_{01}\)

At level three:

  • \(\gamma_{000}\) is the true mean intercept while \(\gamma_{100}\) is the true mean yearly growth rate for over all schools
  • \(r_{00j}\) and \(r_{10j}\) captures the deviation between school \(i\) true overall growth trajectory and population mean intercept and slope

In combining the levels, the full composite model may be obtained:

\[\begin{aligned} Y_{tij} &= \pi_{0ij} + \pi_{1ij}year_{tij} + \epsilon_{tij} \\ &= (\beta_{00j} + u_{0ij}) + (beta_{10j} + u_{1ij})year_{tij} + \epsilon_{tij} \\ &= (\gamma_{000} + r_{00j} + u_{0ij}) + (\gamma_{100} + r_{10j} + u_{1ij})year_{tij} + \epsilon_{tij} \\ &= [\gamma_{000} + \gamma_{100}year_{tij}] + [r_{00j} + u_{0ij} + \epsilon_{tij} + (r_{10j} + u_{1ij})year_{tij}] \end{aligned}\]

With, \(\epsilon_{ijk} \sim N(0, \sigma^2)\),

\[\left( \begin{matrix} u_{0j} \\ u_{1j} \end{matrix} \right) \sim N \left( \begin{matrix} 0 \\ 0 \end{matrix} \begin{matrix} , \end{matrix} \begin{matrix} \tau^2_{00} & \tau_{01} \\ \tau_{01} & \tau^{2}_{10} \end{matrix} \right)\]

\[\left( \begin{matrix} r_{00j} \\ r_{10j} \end{matrix} \right) \sim N \left( \begin{matrix} 0 \\ 0 \end{matrix} \begin{matrix} , \end{matrix} \begin{matrix} \phi^2_{00} & \\ \phi_{01} & \phi^{2}_{10} \end{matrix} \right)\]

5. Dealing with boundary issues, if any

The multilevel models may encounter boundary constraint, especially when random slopes are introduced at level two and three. This singular fit error occurs as correlation coefficients between two error terms are at the ‘boundary’ of the allowable values \([1,1]\). In this report, the error is an indication of overfitting of the model (Ben Bolker et al. (2021)) – That is, the random effects structure proves to be too complex to be supported by the data.

Naturally, this suggests that the model requires re-parameterisation by altering to feature different parameters. This report considers simplifying the model by removing level three error terms (\(r_{10j}\)); which effectively removes two parameters (1) Variance for \(r_{10j}\) and (2) Correlation between \(r_{01j}\) and \(r_{10j}\)

With respect to the unconditional growth model, the model is simplified by setting the error term \(r_{10}j = 0\) (and therefore \(r_{10} = 0\)) at level three:

\[\begin{aligned} \beta_{00j} = \gamma_{000} + r_{00j} \\ \beta_{10j} = \gamma_{100} \end{aligned}\]

This implies that the model’s error assumption at level three will turn into a univariate (as opposed to a multivariate) normal distribution. Furthermore, this will also imply that level-three (e.g.) clusters will be associated with the same growth rate; In other words, each cluster will have the same rate of change in enrolments for each year. In most cases, removing these parameters will have an insignificant impact on the fixed effects. Therefore, the benefit of this approach is that it will lead to a more parsimonious model that is not over-fitted and free from any boundary constraints.

6. Adding fixed fixed effects and comparing potential models with criterion-based approach

Adding sector and unit as fixed effects

In general, there are two types of predictors which may be added to the longitudinal model (1) Time varying (level 1): predictors that are associated with a specific measurement of time (2) Time invariant (level 2 or higher): Predictors associated with schools or clusters measured only at one point in time

At this point, the model only has a dedicated time variable that makes the multilevel longitudinal, with no other predictors added. With reference to Table 1, this report will consider school-specific and subject-specific variables as there are no cluster-specific (district or postcode information) variables. To be specific, two time-invariant predictors (sector and unit) will be added to the model as fixed effects. To be noted, these categorical variables will not be added as random effects as there is not enough random-effect levels (e.g. blocks) for these predictors. As a rule of thumb, there should be more than 5 or 6 levels at a minimum (Ben Bolker et al. (2021)). Furthermore, adding another level to the model may overcomplicate interpretations.

Forward Stepwise Approach to select the best model

In adding sector and unit into the model, there will be a handful of potential models (e.g. three-way interactions and two-way interactions, removing three-way interactions etc.). This report considers a forward stepwise approach, where the largest possible model (all possible combinations of fixed effects) will be fitted, before removing the fixed effects sequentially (one after another). The AIC will then be recorded for each model; Noting that models will be fitted with with maximum likelihood estimates rather than the default restricted maximum likelihood estimates (REML) as REML will generally lead to erroneous results as the fixed effects change (Faraway (2016)). The model with the lowest AIC will be deemed the “best” model, based on the fixed effects.

7. Test random effects using parametric bootstrap

The parametric bootstrap will be utilised to examine if simplified (smaller) version of model (with fewer random effects) is preferred. For a three-level multilevel model, this involves removing the random slope (\(r_{10j}\)) in the larger (proposed) model at level-three to create the smaller model. Removing \(r_{10j}\) is equivalent to setting \(\phi_{01} = 0\) and \(\phi_{10} = 0\). In other words, the parametric bootstrap tests if the variance of a random effect \(r_{10j}\), is in fact 0.

The parametric bootstrap better approximates the correct \(p\)-value for the corresponding likelihood ratio test by simulating under the null hypothesis, especially when the actual likelihood ratio test statistic is at the vicinity of the critical value of the (wrong) reference distribution. The standard \(\chi^2\) asymptotic are not used as it is proven to be produce a ‘too conservative’ test, where the \(p\)-values are usually too large and not rejected enough (Faraway (2016)).

Process

In conducting the parametric bootstrap involving variance terms at the boundary, the null and alternative hypothesis can be written as: \[\begin{aligned} H_0: \phi^2_{00} = \phi_{01} \ne 0 \\ H_A: \phi^2_{00} = \phi_{01} = 0 \end{aligned}\]

  1. Begin by fitting both (full and smaller) models to the actual data with maximum likelihood estimates to obtain the actual likelihood ratio test statistic (\(2 \times logL(\text{full model}) - logL(\text{reduced model})\)). This will be called the “actual (observed) likelihood ratio test statistic.”

  2. Using the smaller model, obtain estimated fixed effects & variance components (parametric).

  3. Use estimated fixed effects and variance components to generate pseudo-data from the null model – i.e. regenerate a new set of response (log(enrolments)) with its associated fixed effects for each observation (bootstrap).

  4. Fit both smaller and reduced model to the new pseudo-data.

  5. Compute likelihood ratio test statistic comparing both models.

  6. Repeat steps 2-4 for a large number of times (this report simulates \(B = 1,000\) bootstrap samples).

  7. The estimated \(p\)-value is computed by finding the proportion of likelihood ratio test statistic that exceeds the actual (observed) value in step (1). That is, the chance of observing the large (small) actual likelihood test statistic or check if the large (small) actual test statistic was a rare occurrence.

Note:: Code is provided by fabians ((https://stats.stackexchange.com/users/1979/fabians) (n.d.)); Some modifications were made to the original code to be aligned with the analysis.

In conducting the parametric bootstrap, a histogram of the likelihood ratio test statistic will then be implemented to visualise the results (can be compared with a \(\chi^2\)-distribution with 2 degrees of freedom). If the null hypothesis is rejected at the 5% level (i.e. p-value < 0.05), it indicates that the smaller model is not correct and the full model will be selected.

8. Parametric bootstrap to conduct Confidence interval

With our final model, confidence intervals can be constructed for the parameters; This can be done via the confint(method = "boot") function. However, it is always better to get a more intuitive understanding of what the function is doing under the hood.

  1. Simulate data from the proposed model

  2. Refit the model using the new responses from the simulated data and estimate the parameters

  3. Repeat steps 1 and 2 for a larger number of times, storing the results each time

  4. Use quantiles of bootstrapped estimates to compute the estimated intervals

  • Like any other confidence interval, the output can be interpreted as – if the procedure was repeated many times over, the true value will lie within the interval 95% of the time.

  • This confidence interval can be used to reinforce the idea that the larger model is better by checking if the variance parameters does not include 0, suggesting that the random effects are significant at the 5% level.

9. Interpreting the final model

Fixed effects

Fixed effects are functions of the covariates that accounts for the differences over time, ‘controlling out’ higher-level differences and processing the distinctive specific characteristics of higher-level units (Bell and Jones (2015)). Fixed effects introduces a separate parameter for each group – equivalent to introducing dummy variables for each higher-level entity (less a reference category) – where each group are independent and has its own individual characteristics. As will be demonstrated, it allows for analysing the trends in enrolments between the different sectors (Catholic, Independent and Government) and units (year 11 and year 12) and the coefficients may be interpreted like a standard regression model.

Random effects

With the higher-level variance being ‘controlled out’ by the fixed effects, the random effects framework allows for heterogeneity to not only be corrected, but explicitly modelled (Bell and Jones (2015)). The random effects expresses the variation between all schools, which assumes that the higher-level entities comes from random draws of a normal distribution – which the variance is estimated by the model – and can itself be interpretable and relevant. This implies that random effects induces a correlation between schools at the same level (e.g. schools in the same district).

Prediction for Mathematics Subjects

Mathematical Methods

Exploring the dataset with basic linear model

Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Mathematical Methods

Figure 11: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Mathematical Methods

Figure 11 fits a linear model for 20 randomly selected schools. As enrolments are plotted on the same y-axis, the different sizes of school is apparent. For instance, school 174 has more than 100 enrolments for mathematical methods for each cohort, while school 496 and 518 have little enrolments (< 50) for each cohort. Each school showed rather distinct trend, where school 441 and 304 showed a decrease in enrolments while schools 28, 174, 417 showed an increase in enrolments, on average. It can also be observed that school 391 (St Mary’s College) only introduced the subject in 2020, as it was a relatively new school.

Getting the data ready for modelling

Removing zero enrolments

All zero enrolments in a given year will be removed for modelling. As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. These zero enrolments will be removed for modelling purposes.

Linearise response variable using log transformation

Effects of log transformation for response variable (enrolments) in Mathematical Methods

Figure 12: Effects of log transformation for response variable (enrolments) in Mathematical Methods

As multilevel model assumes normality in the error terms, a log transformation is utilised to allow models to be estimated by the linear mixed models. The log transformation allows enrolment numbers to be approximately normally distributed (Figure 12).

Unconditional means model

Table 6: AIC values for all candidate models for Mathematical Methods
df AIC
Model0.2: Schools nested within districts 4 24303.39
Model0.1: Schools nested within postcodes 4 24333.96
Model0.0: Within schools 3 24346.27

As per step 3, the three potential models are fitted, with the AIC shown in Table 6. Based on the AIC, model0.2, corresponding the schools nested within districts is the best model and will be used in the subsequent analysis.

Intraclass correlation (\(ICC\))

summary(model0.2)
## Random effects:
##  Groups                       Name        Variance Std.Dev.
##  qcaa_school_id:qcaa_district (Intercept) 0.83408  0.91328 
##  qcaa_district                (Intercept) 0.14864  0.38554 
##  Residual                                 0.17998  0.42424
## 
##  Fixed effects:
##             Estimate Std. Error  t value
## (Intercept) 3.305797  0.1152798 28.67629
## 
##  Number of schools (level-two group) = 466 
##  Number of district (level-three group) = 13

This model will takes into account 466 schools nested in 13 districts. In a three-level multilevel model, two intraclass correlations can be obtained using the model summary output above:

The level-two ICC relates to the correlation between school \(i\) from district \(k\) in time \(t\) and in time \(t^* \ne t\):

\[ \text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.83408}{(0.83408 + 0.14864 + 0.17998)} = 0.7174\]

This can be conceptualised as the correlation between enrolments of two random draws from the same school at two different years. In other words, 71.74% of the total variability is attributable to the changes overtime within schools.

The level-three ICC refers to the correlation between different schools \(i\) and \(i^*\) from a specific school \(j\).

\[ \text{Level-three ICC} = \frac{\phi^2_{00}}{\tau^2_{00} + \phi^2_{00} + \sigma^2} = \frac{0.14864}{(0.83408 + 0.14864 + 0.17998)} = 0.1278\]

Similarly, this can be conceptualised as the correlation between enrolments of two randomly selected schools from the same district – i.e. 12.78% of the total variability is due to the difference between districts.

Unconditional Growth model

The unconditional growth model introduces the time predictor at level one, the model specification can be found in step 4. This allows for assessing within-school variability which can be attributed to linear changes over time. Furthermore, variability in intercepts and slopes can be obtained to compare schools within the same districts, and schools from different districts.

summary(model1.0)
##  Groups                       Name        Variance   Std.Dev.  Corr  
##  qcaa_district:qcaa_school_id (Intercept) 2.0601e+00 1.4353101       
##                               year92      1.8682e-03 0.0432225 -0.753
##  qcaa_district                (Intercept) 6.9225e-02 0.2631056       
##                               year92      5.9806e-05 0.0077335 1.000 
##  Residual                                 1.2387e-01 0.3519528
##                Estimate  Std. Error   t value
## (Intercept) 3.066187193 0.100976610 30.365321
## year92      0.009110851 0.003051126  2.986062
##  Number of Level Two groups =  466 
##  Number of Level Three groups =  13
  • \(\pi_{0ij}\) = 2.7409: Initial status for school \(i\) in district \(j\) (i.e. expected log enrolments when time = 0)
  • \(\pi_{1ij}\) = 0.0465: Growth rate for school \(i\) in district \(j\)
  • \(\epsilon_{tij}\) = 0.162813: Variance in within-school residuals after accounting for linear growth overtime

When the subject was first introduced in 1992, schools were expected to have 15.5009 (\(e^{1.7848}\)) enrolments, on average. Enrolments were expected to increase by 4.7598% (\((e^{0.0465} - 1)\times100\)) per year. The estimated within-schools variance decreased by -9.5556% (0.1800 to 0.1628), implying that -9.5556% of within-school variability can be explained by the linear growth over time.

Dealing with boundary constraint

A singular fit is observed in the model as the correlation between the intercept and slope between districts are perfectly correlation (i.e. \(\phi_{01}\) = 1). This may suggest that the model is overfitted – i.e. the random effects structure is too complex to be supported by the data and may require some re-parameterisation. Naturally, the higher-order random effects (e.g. random slope of the third level (between district)) can be removed, especially where the variance and correlation terms are estimated on the boundaries (Roback and Legler (2021)).

summary(model1.1)
##  Groups                       Name        Variance  Std.Dev. Corr  
##  qcaa_district:qcaa_school_id (Intercept) 2.0930666 1.446743       
##                               year92      0.0019446 0.044097 -0.757
##  qcaa_district                (Intercept) 0.2039055 0.451559       
##  Residual                                 0.1238374 0.351905
##                Estimate Std. Error   t value
## (Intercept) 3.053472289 0.14377570 21.237749
## year92      0.009675574 0.00220878  4.380505
##  Number of Level Two groups =  466 
##  Number of Level Three groups =  13

To elaborate, two parameters were removed by setting variance components \(\phi^2_{10}\) = \(\phi_{01}\) equal to zero Which indirectly assumes that the growth rate for district \(j\) to be fixed. As shown in the output above, this produced a more stable model and is free from any boundary constraints. As compared to the unconditional growth model (model1.0), the fixed effects remained rather similar.

Level one and level two will be identical to the unconditional growth model (model1.0), however, the random slope for level 3 will be removed. This implies that the error assumption at level three now follows a univariate normal distribution where \(r_{00j} \sim N(0, \phi_{00}^{2})\).

The new level three (districts): \[\beta_{00j} = \gamma_{000} + r_{00j} \\ \beta_{10j} = \gamma_{100} \]

And therefore composite model: \[\begin{aligned} Y_{tij} &= \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ &= (\beta_{00j} + u_{0ij}) + (beta_{10j} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ &= (\gamma_{000} + r_{00j} + u_{0ij}) + (\gamma_{100} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ &= \left[\gamma_{000} + \gamma_{100}year92_{tij} \right] + \left[r_{00j} + u_{0ij} + u_{1ij}year92_{tij} + \epsilon_{tij} \right] \end{aligned}\]

Testing fixed effects

Table 7: AIC for all possible models with different combinations of fixed effects
model AIC
model4.7 18176.17
model4.1 18177.29
model4.4 18177.87
model4.0 18178.63
model4.5 18178.71
model4.10 18238.84
model4.9 18238.84
model4.2 18238.85
model4.8 18238.85
model4.6 18239.98
model4.3 18241.58

As highlighted in step 6, sector and unit will be added as predictors to the model. The largest possible model will be fitted, before removing fixed effects one by one while recording the AIC for each model. In this case, model4.0 corresponds to the largest possible model while model4.10 is the smallest possible model. The model with the optimal (lowest) AIC is model4.7 (Table 7). The next section will test the selected model’s random effects to build the final model.

Parametric bootstrap to test random effects

This step will not be undertaken, as the random slope will not be included at level three as a boundary constraint was found in the unconditional growth model, indicating that the model will be overfitted if random slopes were included at level three.

Confidence interval

Table 8: 95% confidence intervals for fixed and random effects in the final model
var 2.5 % 97.5 %
sd_(Intercept)|qcaa_district:qcaa_school_id 1.2611487 1.4536475
cor_year92.(Intercept)|qcaa_district:qcaa_school_id -0.7727809 -0.6789689
sd_year92|qcaa_district:qcaa_school_id 0.0370458 0.0430934
sd_(Intercept)|qcaa_district 0.2229889 0.6290453
sigma 0.3444005 0.3511923
(Intercept) 3.0378500 3.8143678
year92 -0.0002020 0.0197675
sectorGovernment -0.3032420 0.4819675
sectorIndependent -1.6557356 -0.8014642
unityear_12_enrolments -0.1087469 -0.0620239
year92:sectorGovernment -0.0257979 -0.0028113
year92:sectorIndependent 0.0132636 0.0383081
sectorGovernment:unityear_12_enrolments -0.0550680 0.0005954
sectorIndependent:unityear_12_enrolments -0.0385138 0.0211624

The parametric bootstrap is utilised to construct confidence intervals (detailed explanation in step 8) for the random effects. If the confidence intervals between the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The confidence interval for the random effects all exclude 0, indicating that they’re different from 0 in the population (i.e. statistically significant).

Interpreting final model

Composite model

  • Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij}\]

  • Level two (schools within districts) will contain new predictor(sector) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij} \end{aligned}\]

  • Level three (districts) \[\begin{aligned} \beta_{00j} &= \gamma_{000} + r_{00j} \\ \beta_{01j} &= \gamma_{010} + r_{01j} \\ \beta_{02j} &= \gamma_{020} + r_{02j} \\ \beta_{03j} &= \gamma_{030} + r_{03j} \\ \beta_{10j} &= \gamma_{100} \\ \beta_{11j} &= \gamma_{110} \end{aligned}\]

Therefore, the composite model can be written as:

\[\begin{aligned} Y_{tij} =\ & \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ =\ & (\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + r_{00j} + (\gamma_{010} + r_{01j})sector_{ij} + (\gamma_{020} + r_{02j})unit_{ij} + (\gamma_{030} + r_{03j})sector_{ij}unit_{ij} + u_{0ij} \right] + \\ & \left[\gamma_{100} + \gamma_{110}sector_{ij} + u_{1ij} \right]year92_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + \gamma_{010}sector_{ij} + \gamma_{020}unit_{ij} + \gamma_{030}sector_{ij}unit_{ij} +\gamma_{100}year92_{tij} + \gamma_{110} sector_{ij}year92_{tij} \right] + \\ & \left[r_{00j} + r_{01j}sector_{ij} + r_{02j}unit_{ij} + r_{03j}sector_{ij}unit_{ij} + u_{0ij} + u_{1ij}year92_{tij} + \epsilon_{tij} \right] \end{aligned}\]

Fixed effects

summary(model_f)
##  Groups                       Name        Variance  Std.Dev. Corr  
##  qcaa_district:qcaa_school_id (Intercept) 1.8467382 1.358947       
##                               year92      0.0016073 0.040092 -0.729
##  qcaa_district                (Intercept) 0.2117140 0.460124       
##  Residual                                 0.1210798 0.347965
##                                              Estimate  Std. Error    t value
## (Intercept)                               3.429618908 0.202380825 16.9463629
## year92                                    0.009468131 0.004804577  1.9706483
## sectorGovernment                          0.067490171 0.182073306  0.3706758
## sectorIndependent                        -1.246645798 0.198604067 -6.2770406
## unityear_12_enrolments                   -0.084742947 0.011705675 -7.2394753
## year92:sectorGovernment                  -0.013953478 0.005543237 -2.5172075
## year92:sectorIndependent                  0.026088675 0.006094448  4.2807280
## sectorGovernment:unityear_12_enrolments  -0.028631744 0.013514625 -2.1185748
## sectorIndependent:unityear_12_enrolments -0.009183082 0.015106467 -0.6078908
##  Number of Level Two groups =  466 
##  Number of Level Three groups =  13

Based on the model output (see detailed explanation of fixed effects in step 9), the estimated mean enrolments for government school are expected to decrease by 0.4475% (\((e^{0.00946813 - 0.01395348} - 1) * 100\)) each year, which is 1.3857% (\((e^{- 0.01395348} - 1) * 100\)) less than that of catholic schools. On the other hand, independent schools are predicted to have an increase of 3.6197% (\((e^{0.00946813 + 0.02608867} - 1) * 100\)), which is 2.6432% more than that of catholic schools.

Fixed effects of the final model for Mathematical Methods

Figure 13: Fixed effects of the final model for Mathematical Methods

Based on the fixed effects (Figure 13), independent schools are expected to have the relatively highest increase in average enrolments across the years while government school showed a decrease in enrolments over the years. All sectors shows the year 11 enrolments are expected to be more than year 12 enrolments each year; This may indicate that students may be discontinuing the subject in year 11 after completing year 11 syllabus.

Random effects

Random effects for all schools

Figure 14: Random effects for all schools

Figure 14 represents the random intercept and slope of the random effects for a given school. It is manifest that the intercept and slope are negatively correlated (-0.73, as shown on the model summary output), where a large intercept is associated with a smaller random slope. This suggests that a larger school is associated with a smaller increase (decrease) in enrolments over the years while smaller schools are predicted to have large increase in enrolments over the years.

Random intercept for districts

Figure 15: Random intercept for districts

As the random slopes are removed, all districts are predicted to have the same increase in enrolments over the years; And as was discussed previously, this was a reasonable assumption or an otherwise perfect correlation with random slope and intercept will be fitted. Figure 15 demonstrates that schools in Brisbane Central has the largest enrolments, on average.

Predictions

Model predictions for 20 randomly selected schools

Figure 16: Model predictions for 20 randomly selected schools

Figure 16 above shows the predictions for 20 randomly selected schools.

Specialist Mathematics

Exploring the dataset with basic linear model

Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Specialist Mathematics subject

Figure 17: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Specialist Mathematics subject

With reference to the first step, Figure 17 fits a linear model for enrolments for a random sample of 20 schools. A myriad of patterns can be obtained from this. For instance, enrolments in school 2 (top left) appears to have significant larger increase in enrolments over the years the specialist mathematics have been introduced. Most schools showed relatively small enrolment numbers of less than 50, however school 2 and 175 showed rather large enrolment numbers for each cohort. It also demonstrates that each school can introduce (or end) specialist mathematics at different years – e.g. School 591 only introduced the subject in 2016, and School 113 introduced the subject in 1995 and discontinued the subject in 2005.

Getting the data ready for modelling

Removing zero enrolments

All zero enrolments in a given year will be removed for modelling. As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. These zero enrolments will be removed for modelling purposes.

Linearise response variable using log transformation

Effects of log transformation for response variable (enrolments) in Specialist Mathematics subject

Figure 18: Effects of log transformation for response variable (enrolments) in Specialist Mathematics subject

As multilevel model assumes normality in the error terms, a log transformation is utilised to allow models to be estimated by the linear mixed models. The log transformation allows enrolment numbers to be approximately normally distributed (Figure 18).

Unconditional means model

Table 9: AIC values for all candidate models for Specialist Mathematics
df AIC
Model0.2: Schools nested within districts 4 25300.73
Model0.1: Schools nested within postcodes 4 25344.29
Model0.0: Within schools 3 25361.47

As per step 3, the three potential models are fitted, with the AIC shown in Table 9. Based on the AIC, model0.2, corresponding the schools nested within districts is the best model and will be used in the subsequent analysis.

Intraclass correlation (\(ICC\))

summary(model0.2)
## Random effects:
##  Groups                       Name        Variance Std.Dev.
##  qcaa_school_id:qcaa_district (Intercept) 0.42131  0.64908 
##  qcaa_district                (Intercept) 0.11159  0.33405 
##  Residual                                 0.27720  0.52649
## 
##  Fixed effects:
##             Estimate Std. Error  t value
## (Intercept) 2.120037 0.09828461 21.57039
## 
##  Number of schools (level-two group) = 422 
##  Number of district (level-three group) = 13

This model will takes into account 422 schools nested in 13 districts. In a three-level multilevel model, two intraclass correlations can be obtained using the model summary output above:

The level-two ICC relates to the correlation between school \(i\) from district \(k\) in time \(t\) and in time \(t^* \ne t\):

\[\text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.4213}{(0.4213 + 0.1116 + 0.2772)} = 0.5201\]

This can be conceptualised as the correlation between enrolments of two random draws from the same school at two different years. In other words, 52.01% of the total variability is attributable to the changes overtime within schools.

The level-three ICC refers to the correlation between different schools \(i\) and \(i^*\) from a specific district \(j\) in time \(t\) and time \(t^* \ne t\).

\[\text{Level-three ICC} = \frac{\phi^2_{00}}{\tau^2_{00} + \phi^2_{00} + \sigma^2} = \frac{0.1116}{(0.4213 + 0.1116 + 0.2772)} = 0.1378\]

Similarly, this can be conceptualised as the correlation between enrolments of two randomly selected schools from the same district – i.e. 11.70% of the total variability is due to the difference between districts.

Unconditional growth model

The unconditional growth model introduces the time predictor at level one, the model specification can be found in step 4. This allows for assessing within-school variability which can be attributed to linear changes over time. Furthermore, variability in intercepts and slopes can be obtained to compare schools within the same districts, and schools from different districts.

summary(model1.0)
##  Groups                       Name        Variance  Std.Dev. Corr  
##  qcaa_district:qcaa_school_id (Intercept) 0.8214386 0.906332       
##                               year92      0.0011262 0.033559 -0.654
##  qcaa_district                (Intercept) 0.0000000 0.000000       
##                               year92      0.0002295 0.015149   NaN 
##  Residual                                 0.2092841 0.457476
##               Estimate Std. Error   t value
## (Intercept) 1.79689374 0.04801983 37.419825
## year92      0.01647784 0.00461389  3.571355
##  Number of Level Two groups =  422 
##  Number of Level Three groups =  13
  • \(\pi_{0ij}\) = 1.7969: Initial status for school \(i\) in district \(j\) (i.e. expected log enrolments when time = 0)
  • \(\pi_{1ij}\) = 0.0165: Growth rate for school \(i\) in district \(j\)
  • \(\epsilon_{tij}\) = 0.2093: Variance in within-school residuals after accounting for linear growth overtime

When the subject was first introduced in 1992, schools were expected to have 6.0309 (\(e^{1.7848}\)) enrolments, on average. This enrolments were rather low as there were only a small fraction of schools that offered the subject in 1992 (as demonstrated in Figure 2). On average, enrolments were expected to increase by 1.6614% (\((e^{0.0164778} - 1)\times100\)) per year. The estimated within-schools variance decreased by 24.45% (0.2772 to 0.2093), implying that 24.5% of within-school variability can be explained by the linear growth over time.

Dealing with boundary constraints

A singular fit is observed in the model as the correlation between the intercept and slope between districts are perfectly correlation (i.e. \(\phi_{01}\) = 1). This may suggest that the model is overfitted – i.e. the random effects structure is too complex to be supported by the data and may require some re-parameterisation. Naturally, the higher-order random effects (e.g. random slope of the third level (between district)) can be removed, especially where the variance and correlation terms are estimated on the boundaries (add bookdown reference).

summary(model1.1)
##  Groups                       Name        Variance  Std.Dev. Corr  
##  qcaa_district:qcaa_school_id (Intercept) 0.8433365 0.918334       
##                               year92      0.0012537 0.035407 -0.679
##  qcaa_district                (Intercept) 0.1201703 0.346656       
##  Residual                                 0.2093102 0.457504
##               Estimate  Std. Error   t value
## (Intercept) 1.75561841 0.107914664 16.268581
## year92      0.01832492 0.001964398  9.328521
##  Number of Level Two groups =  422 
##  Number of Level Three groups =  13

To elaborate, two parameters were removed by setting variance components \(\phi^2_{10}\) = \(\phi_{01}\) equal to zero Which indirectly assumes that the growth rate for district \(j\) to be fixed. As shown in the output above, this produced a more stable model and is free from any boundary constraints. As compared to the unconditional growth model (model1.0), the fixed effects remained rather similar.

Level one and level two will be identical to the unconditional growth model (model1.0), however, the random slope for level 3 will be removed. This implies that the error assumption at level three now follows a univariate normal distribution where \(r_{00j} \sim N(0, \phi_{00}^{2})\).

The new Level three (districts): \[\beta_{00j} = \gamma_{000} + r_{00j} \\ \beta_{10j} = \gamma_{100} \]

And therefore composite model: \[\begin{aligned} Y_{tij} &= \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ &= (\beta_{00j} + u_{0ij}) + (beta_{10j} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ &= (\gamma_{000} + r_{00j} + u_{0ij}) + (\gamma_{100} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ &= \left[\gamma_{000} + \gamma_{100}year92_{tij} \right] + \left[r_{00j} + u_{0ij} + u_{1ij}year92_{tij} + \epsilon_{tij} \right] \end{aligned}\]

Testing Fixed effects

Table 10: AIC for all possible models with different combinations of fixed effects
model AIC
model4.4 22025.38
model4.5 22027.29
model4.7 22028.11
model4.1 22030.06
model4.0 22033.52
model4.3 22044.92
model4.2 22045.67
model4.8 22045.67
model4.10 22045.67
model4.9 22045.67
model4.6 22047.61

As highlighted in step 6, sector and unit will be added as predictors to the model. The largest possible model will be fitted, before removing fixed effects one by one while recording the AIC for each model. In this case, model4.0 corresponds to the largest possible model while model4.10 is the smallest possible model. The model with the optimal (lowest) AIC is model4.4 (Table 10). The next section will test the selected model’s random effects to build the final model.

Testing random effects with parametric bootstrpa

This step will not be undertaken, as the random slope will not be included at level three as a boundary constraint was found in the unconditional growth model, indicating that the model will be overfitted if random slopes were included at level three.

Table 11: Parametric Bootstrap to compare larger and smaller, nested model
npar AIC BIC logLik deviance Chisq Df Pr_boot(>Chisq)
10 24312.80 24389.11 -12146.40 24292.80 NA NA NA
12 22025.38 22116.96 -11000.69 22001.38 2291.42 2 0

Confidence interval

Table 12: 95% confidence intervals for fixed and random effects in the final model
var 2.5 % 97.5 %
sd_(Intercept)|qcaa_district:qcaa_school_id 0.8475964 0.9886445
cor_year92.(Intercept)|qcaa_district:qcaa_school_id -0.7244002 -0.5994735
sd_year92|qcaa_district:qcaa_school_id 0.0314348 0.0370023
sd_(Intercept)|qcaa_district 0.1917148 0.5116679
sigma 0.4517649 0.4620813
(Intercept) 1.5792349 2.1738440
sectorGovernment -0.2351881 0.2926828
sectorIndependent -0.7884265 -0.1855885
unityear_12_enrolments -0.0559129 -0.0286231
year92 0.0058831 0.0231169
sectorGovernment:year92 -0.0122945 0.0079922
sectorIndependent:year92 0.0074664 0.0310211

The parametric bootstrap is utilised to construct confidence intervals (detailed explanation in step 8) for the random effects. If the confidence intervals between the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The confidence interval for the fixed and random effects all exclude 0 (Table 12), indicating that they’re different from 0 in the population (i.e. statistically significant).

Interpreting the final model

Composite model

  • Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij}\]

  • Level two (schools within districts) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij} \end{aligned}\]

  • Level three (districts) \[\begin{aligned}\beta_{00j} &= \gamma_{000} + r_{00j} \\ \beta_{01j} &= \gamma_{010} + r_{01j} \\ \beta_{02j} &= \gamma_{020} + r_{02j} \\ \beta_{10j} &= \gamma_{100} \\ \beta_{11j} &= \gamma_{110} \end{aligned}\]

Therefore, the composite model can be written as \[\begin{aligned} Y_{tij} &= \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ &= (\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ &= \left[\gamma_{000} + r_{00j} + (\gamma_{010} + r_{01j})sector_{ij} + (\gamma_{020} + r_{02j})unit_{ij} + u_{0ij} \right] + \left[\gamma_{100} + \gamma_{110}sector_{ij} + u_{1ij} \right]year92_{tij} + \epsilon_{tij} \\ &= \left[\gamma_{000} + \gamma_{010}sector_ {ij} + \gamma_{020}unit_{ij} + \gamma_{100}year92_{tij} + \gamma_{110}sector_{ij}year92_{tij} \right] + \left[r_{00j} + r_{01j}sector_{ij} + r_{02j}unit_{ij} + u_{0ij} + u_{1ij}year92_{tij} + + \epsilon_{tij} \right]\end{aligned}\]

Fixed effects

summary(model_f)
##  Groups                       Name        Variance  Std.Dev. Corr  
##  qcaa_district:qcaa_school_id (Intercept) 0.8370257 0.914891       
##                               year92      0.0011733 0.034254 -0.666
##  qcaa_district                (Intercept) 0.1262182 0.355272       
##  Residual                                 0.2087225 0.456862
##                             Estimate  Std. Error    t value
## (Intercept)               1.88540434 0.150880944 12.4959740
## sectorGovernment          0.02044649 0.131787850  0.1551470
## sectorIndependent        -0.48119714 0.148800988 -3.2338303
## unityear_12_enrolments   -0.04230158 0.007419052 -5.7017495
## year92                    0.01448120 0.004293656  3.3726968
## sectorGovernment:year92  -0.00194999 0.005033040 -0.3874377
## sectorIndependent:year92  0.01893009 0.005655238  3.3473547
##  Number of Level Two groups =  422 
##  Number of Level Three groups =  13

Based on the model output (see detailed explanation of fixed effects in step 9), the estimated mean enrolments for government schools are estimated to be 2.06% (\((e^{0.02044} - 1) \times 100\)) more than that of catholic schools when the subject is first introduced in 1992 (i.e. larger initial status). Government schools are estimated to have a mean increase in enrolments of 1.2610% (\((e^{(0.01448 - 0.0019)} - 1) \times 100\)) per year, which is -0.1948% (\((e^(-0.00194999) - 1) * 100)\)) less than that of catholic schools.

On the other hand, independent schools have an estimated mean enrolments of 38% less than that of catholic schools when the subject is first introduced in 1992. However, this low initially low enrolments is matched with an a mean increase of 3.3976% per year, which is 1.9110% more than that of catholic schools per year.

Fixed effects of the final model for Specialist Mathematics subject

Figure 19: Fixed effects of the final model for Specialist Mathematics subject

Based on the fixed effects, independent schools are expected to have highest log enrolments after 2020 (Figure 19. In all sectors, unit 11 enrolments appears to be marginally larger than unit 12, which may imply that students are taking the subject in year 10, so lesser students are enrolled in the subject in year 11. Government sectors appears to have the highest enrolment numbers initially (in year 1992), but have increases at a relatively slower rate as compared to the other sectors.

To be noted, these fixed effects considers all schools within the different sectors; Therefore, the many small government schools with little enrolments in the subject (as seen in Figure 4) may ‘down weight’ the mean fixed effects of all schools.

Random effects

Random effects for all schools

Figure 20: Random effects for all schools

Figure 20 represents the random intercept and slope of the random effects for a given school. It is a manifest that the intercept and slope are negatively correlated, where a large intercept is associated with a smaller random slope. This suggests that a larger school is associated with a smaller increase (decrease) in enrolments over the years while smaller schools are predicted to have large increase in enrolments over the years.

Random intercept for districts

Figure 21: Random intercept for districts

As the random slopes are removed, all districts are predicted to have the same increase in enrolments over the years; And as was discussed previously, this was a reasonable assumption or an otherwise perfect correlation with random slope and intercept will be fitted. Figure 21 demonstrates that schools in Brisbane Central has the largest enrolments, on average.

Predictions

Model predictions for 20 randomly selected schools

Figure 22: Model predictions for 20 randomly selected schools

Figure 22 above shows the predictions for 20 randomly selected schools.

Essential Mathematics

Exploring the dataset with basic linear model for each school

Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for biology subject

Figure 23: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for biology subject

As stated above, Essential Mathematics is one of the new subject introduced in the QCE system. Thus, there are only observations for three years, spanning from 2020 to 2022, as shown in Figure 23. The various sizes of schools can be seen in the plot, where schools 589, 631, 753 (bottom-right) have relatively low enrolments as compared to school 40 or school 88 (top-right). School 484 introduced the school in 2020, but discontinued offering the subject since.

Getting the data ready for modelling

Linearise response variable using log transformation

The enrolments were right skewed, which is likely to be attributed to the various school sizes (as seen in Figure 23). A log transformation was implemented to the response variable (i.e. enrolments) to allow the the multilevel model to better capture the enrolment patterns.

Unconditional means model

Table 13: AIC values for all candidate models for Essential Mathematics
df AIC
Model0.2: Schools nested within districts 4 3614.649
Model0.0: Within schools 3 3619.822
Model0.1: Schools nested within postcodes 4 3621.822

Referring back to step 3, three candidate models are fitted, with the AIC shown in Table 13. Model0.2, corresponding to having schools nested within districts is the best model, with optimised (lowest) AIC and will be used in the subsequent analysis.

Intraclass correlation (\(ICC\))

## Random effects:
##  Groups                       Name        Variance Std.Dev.
##  qcaa_school_id:qcaa_district (Intercept) 1.148377 1.07162 
##  qcaa_district                (Intercept) 0.047705 0.21841 
##  Residual                                 0.113995 0.33763
## 
##  Fixed effects:
##             Estimate Std. Error  t value
## (Intercept) 3.841442 0.07919555 48.50577
## 
##  Number of schools (level-two group) = 458 
##  Number of district (level-three group) = 13

This model takes into account 458 schools nested in 13 districts. In a three-level multilevel model, two intraclass correlations can be obtained using the model summary output above:

The level-two ICC relates to the correlation between school \(i\) from a certain district \(k\) in time \(t\) and in time \(t^*\):

\[\text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{1.1484}{(1.1484 + 0.0477 + 0.1140)} = 0.8766\]

This can be conceptualised as the correlation between enrolments of two random draws from the same school at two different years. In other words, 87.66% of the total variability is attributable to the differences between schools from the same district rather than changes over time within schools.

The level-three ICC refers to the correlation between different schools \(i\) and \(i^*\) from a specific school \(j\).

\[\text{Level-three ICC} = \frac{\phi^2_{00}}{\tau^2_{00} + \phi^2_{00} + \sigma^2} = \frac{0.0477}{(1.1484 + 0.0477 + 0.1140)} = 0.0364\]

Similarly, it can be inferred that the correlation between enrolments of two randomly selected schools from different districts are 3.64%, where the total variability can be attributed to the difference between districts.

Unconditional Growth model

##  Groups                       Name        Variance   Std.Dev. Corr  
##  qcaa_district:qcaa_school_id (Intercept) 1.14444636 1.069788       
##                               year20      0.04903558 0.221440 -0.088
##  qcaa_district                (Intercept) 0.04361994 0.208854       
##                               year20      0.00082473 0.028718 0.230 
##  Residual                                 0.07323625 0.270622
##               Estimate Std. Error   t value
## (Intercept) 3.78369587 0.07738087 48.897045
## year20      0.05820918 0.01485465  3.918583
##  Number of Level Two groups =  458 
##  Number of Level Three groups =  13

The unconditional growth model adds the systematic changes over time, the model specification can be found in step 4. This allows for assessing within-school variability which can be attributed to the linear changes over time. Based on the model output:

  • \(\pi_{0ij}\) = 3.7837: Initial status for school \(i\) in district \(j\) (i.e. expected log enrolments when time = 0)
  • \(\pi_{1ij}\) = 0.0582: Growth rate for school \(i\) in district \(j\)
  • \(\epsilon_{tij}\) = 0.0732: Variance in within-school residuals after accounting for linear growth overtime

Essential Mathematics was first introduced in 2020, and schools are expected to have 43.9785 (\(e^{3.7837}\)), on average. Furthermore, enrolments were expected to increase by 5.9937% (\(e^{0.0582092} - 1) \times 100\)) every year. The estimated within-school variance decrease by 35.7577% (0.1140 to 0.0732362), implying that 35.7577% of the within-school variability can be explained by the linear growth over time.

Testing fixed effects

Table 14: AIC for all possible models with different combinations of fixed effects
model npar AIC BIC logLik
model4.0 19 2946.979 3058.468 -1454.489
model4.6 15 2951.546 3039.564 -1460.773
model4.2 14 2951.814 3033.964 -1461.907
model4.8 14 2951.814 3033.964 -1461.907
model4.9 14 2951.814 3033.964 -1461.907
model4.10 14 2951.814 3033.964 -1461.907
model4.1 17 2955.390 3055.143 -1460.695
model4.7 16 2955.666 3049.552 -1461.833
model4.3 13 2981.319 3057.601 -1477.659
model4.4 14 2985.158 3067.308 -1478.579
model4.5 15 2985.186 3073.204 -1477.593

As detailed in step 6, level-two predictors (sector and unit) are added to the model. The largest possible model will be fitted, before removing each fixed effect one by one whilst recording the AIC for each model. model4.0 corresponds to the largest model while model4.10 is the smallest possible model. The model with the optimal (lowest) AIC is the largest possible model model4.0, and will be used in subsequent sections.

Parametric bootstrap to test random effects

Table 15: Parametric Bootstrap to compare larger and smaller, nested model
npar AIC BIC logLik deviance Chisq Df Pr_boot(>Chisq)
17 2944.454 3044.208 -1455.227 2910.454 NA NA NA
19 2946.979 3058.468 -1454.489 2908.979 1.47565 2 0.256
Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model

Figure 24: Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model

The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. Figure 24 displays the likelihood ratio test statistic from the null distribution, with the red line representing the likelihood ratio test statistic using the actual data.

The p-value of 25.6% (Table 15) indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic. The large estimated \(p\)-value is 0.256 < 0.05 fails to reject the null hypothesis at the 5% level, indicating that the smaller model (without random slope at level three) is preferred.

Confidence interval

Table 16: 95% confidence intervals for fixed and random effects in the final model
var 2.5 % 97.5 %
sd_(Intercept)|qcaa_district:qcaa_school_id 0.7489372 0.8615391
cor_year20.(Intercept)|qcaa_district:qcaa_school_id -0.2253224 -0.0015329
sd_year20|qcaa_district:qcaa_school_id 0.2044296 0.2476368
sd_(Intercept)|qcaa_district 0.1525776 0.4495489
sigma 0.2577367 0.2748534
(Intercept) 3.3929124 3.8944034
year20 0.0209839 0.1432844
sectorGovernment 0.5992458 1.0092179
sectorIndependent -1.1713273 -0.7385895
unityear_12_enrolments 0.0544900 0.1923526
year20:sectorGovernment -0.1098924 0.0381604
year20:sectorIndependent -0.0548805 0.1066265
year20:unityear_12_enrolments -0.1250023 -0.0091553
sectorGovernment:unityear_12_enrolments -0.2852633 -0.1176833
sectorIndependent:unityear_12_enrolments -0.0772452 0.1178977
year20:sectorGovernment:unityear_12_enrolments 0.0179056 0.1611691
year20:sectorIndependent:unityear_12_enrolments -0.0740500 0.0683533

The parametric bootstrap is utilised to construct confidence intervals (as detailed in step 8). If the confidence intervals for the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The 95% confidence interval is shown above (Table 16), and the random effects all exclude 0, further reiterating that they are statistically significant at the 5% level.

Interpreting final model

Composite Model

  • Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year20_{tij} + \epsilon_{tij}\]

  • Level two (schools within districts) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{1ij} \end{aligned}\]

  • Level three (districts) \[\begin{aligned} \beta_{00j} &= \gamma_{000} + r_{00j} \\ \beta_{01j} &= \gamma_{010} + r_{01j} \\ \beta_{02j} &= \gamma_{020} + r_{02j} \\ \beta_{03j} &= \gamma_{030} + r_{03j} \\ \beta_{10j} &= \gamma_{100} \\ \beta_{11j} &= \gamma_{110} \\ \beta_{12j} &= \gamma_{120} \\ \beta_{13j} &= \gamma_{130} \end{aligned}\]

Therefore, the composite model can be written as

\[\begin{aligned} Y_{tij} =\ &\pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ =\ & (\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij}) + \\ & (\beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12j}unit_{ij} + \beta_{13j}sector_{ij}unit_{ij} + u_{1ij})year20_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + r_{00j} + (\gamma_{010} + r_{01j})sector_{ij} + (\gamma_{020} + r_{02j})unit_{ij} + (\gamma_{030} + r_{03j})sector_{ij}unit_{ij} + u_{0ij} \right] + \\ & \left[\gamma_{100} + \gamma_{110}sector_{ij} + \gamma_{120}unit_{ij} + \gamma_{130}sector_{ij}unit_{ij} + u_{1ij} \right]year20_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + \gamma_{010}sector_{ij} + \gamma_{020}unit_{ij} + \gamma_{030}sector_{ij}unit_{ij} + \gamma_{100}year20_{tij} + \gamma_{110}sector_{ij}year_{tij} + \gamma_{120}unit_{ij}year_{tij} + \gamma_{130}sector_{ij}unit_{ij}year_{tij} \right] \\ & \left[r_{00j} + r_{01j}sector_{ij} + r_{02j}unit_{ij} + r_{03j}sector_{ij}unit_{ij} + u_{0ij} + u_{1ij}year20_{tij} + \epsilon_{tij} \right] \end{aligned}\]

Fixed effects

Fixed effects of the final model for Essential Mathematics

Figure 25: Fixed effects of the final model for Essential Mathematics

With a three-way interaction, it is easier to visualise the fixed effects (Figure 25). All sectors demonstrated the same trend with units, where unit 11 showed a larger increase in enrolments, on average. This implies that over the years, more students are enrolling in unit 11 than unit 12.

Random effects

Random effects for schools

Figure 26: Random effects for schools

As shown in the model output and in Figure 26, there was little correlation (-0.12) between the random intercept and slope. Schools with appear to decrease or increase in enrolments at different rates; This may be attributed to the fact that the data only consists of observations for three years and there is not enough data to evaluate the enrolment patterns in schools.

Random intercept for districts

Figure 27: Random intercept for districts

As the random slopes are removed, all districts are predicted to have the same increase in enrolments over the years; And as was discussed previously, this was a reasonable assumption or an otherwise perfect correlation with random slope and intercept will be fitted. Figure 27 demonstrates that schools in Gold Coast has the largest enrolments while Mackay have the lowest enrolments in Essential Mathematics subject, on average.

Predictions

Model predictions for 20 randomly selected schools

Figure 28: Model predictions for 20 randomly selected schools

Figure 28 above shows the predictions for 20 randomly selected schools.

General Mathematics

Exploring the dataset with basic linear model

Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for General Mathematics subject

Figure 29: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for General Mathematics subject

With reference to the first step, Figure 29 fits a linear model for enrolments for a random sample of 20 schools. Various school sizes can be seen, school 633 and 634 (bottom-right) appears to have approximately 10 enrolments per cohort, while school 25 and 34 have enrolments above 200 per cohort. Some schools showed relatively large increase in enrolments over the years, while some showed a decrease (e.g. school 41).

Getting the data ready for modelling

Removing zero enrolments

All zero enrolments in a given year will be removed for modelling. As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. These zero enrolments will be removed for modelling purposes.

Linearise response variable using log transformation

Effects of log transformation for response variable (enrolments) in General Mathematics

Figure 30: Effects of log transformation for response variable (enrolments) in General Mathematics

The enrolments were right skewed, which is likely to be attributed to the various school sizes (as seen in Figure 30). A log transformation was implemented to the response variable (i.e. enrolments) to allow the the multilevel model to better capture the enrolment patterns.

Unconditional means model

Table 17: AIC values for all candidate models for General Mathematics
df AIC
Model0.2: Schools nested within districts 4 24964.05
Model0.0: Within schools 3 24979.49
Model0.1: Schools nested within postcodes 4 24981.49

As per Step 3, the three potential models are fitted, with the AIC shown in Table 17. Based on the AIC, model0.2, corresponding the schools nested within districts is the best model and will be used in the subsequent analysis.

Intraclass correlation (\(ICC\))

## Random effects:
##  Groups                       Name        Variance Std.Dev.
##  qcaa_school_id:qcaa_district (Intercept) 0.891424 0.94415 
##  qcaa_district                (Intercept) 0.069765 0.26413 
##  Residual                                 0.174806 0.41810
## 
##  Fixed effects:
##             Estimate Std. Error  t value
## (Intercept)  3.68386 0.08530786 43.18313
## 
##  Number of schools (level-two group) = 481 
##  Number of district (level-three group) = 13

This model will takes into account 481 schools nested in 13 districts. In a three-level multilevel model, two intraclass correlations can be obtained using the model summary output above:

The level-two ICC relates to the correlation between school \(i\) from district \(k\) in time \(t\) and in time \(t^* \ne t\):

\[ICC(school) = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.89142}{(0.89142 + 0.06976 + 0.17481)} = 0.7847\] This can be conceptualised as the correlation between enrolments of two random draws from the same school at two different years. In other words, 78.47% of the total variability is attributable to the changes overtime within schools.

The level-three ICC refers to the correlation between different schools \(i\) and \(i^*\) from a specific district \(j\) in time \(t\) and time \(t^* \ne t\).

\[ICC(school) = \frac{\phi^2_{00}}{\tau^2_{00} + \phi^2_{00} + \sigma^2} = \frac{0.1077}{(0.5888 + 0.1077 + 0.2094)} = 0.0614\] Similarly, this can be conceptualised as the correlation between enrolments of two randomly selected schools from the same district – i.e. 6.14% of the total variability is due to the difference between districts.

Unconditional Growth model

The unconditional growth model introduces the time predictor at level one, the model specification can be found in step 4. This allows for assessing within-school variability which can be attributed to linear changes over time. Furthermore, variability in intercepts and slopes can be obtained to compare schools within the same districts, and schools from different districts.

##  Groups                       Name        Variance   Std.Dev.  Corr  
##  qcaa_district:qcaa_school_id (Intercept) 3.7608e+00 1.9392756       
##                               year92      3.2360e-03 0.0568856 -0.869
##  qcaa_district                (Intercept) 1.7958e-02 0.1340077       
##                               year92      5.8655e-05 0.0076587 1.000 
##  Residual                                 9.8464e-02 0.3137898
##               Estimate  Std. Error  t value
## (Intercept) 2.91024486 0.098867525 29.43580
## year92      0.03476855 0.003475268 10.00457
##  Number of Level Two groups =  481 
##  Number of Level Three groups =  13
  • \(\pi_{0ij}\) = 2.9102: Initial status for school \(i\) in district \(j\) (i.e. expected log enrolments when time = 0)
  • \(\pi_{1ij}\) = 0.0348: Growth rate for school \(i\) in district \(j\)
  • \(\epsilon_{tij}\) = 0.0984: Variance in within-school residuals after accounting for linear growth overtime

When the subject was first introduced in 1992, schools were expected to have 18.3613 (\(e^{2.9102449}\)) enrolments, on average. On average, enrolments were expected to increase by 3.5380% (\((e^{0.0347686} - 1)\times100\)) per year. The estimated within-schools variance decreased by 90.16% (0.17481 to 0.0984), implying that 90.16% of within-school variability can be explained by the linear growth over time.

Dealing with boundary issues

A singular fit is observed in the model as the correlation between the intercept and slope between districts are perfectly correlation (i.e. \(\phi_{01}\) = 1). This may suggest that the model is overfitted – i.e. the random effects structure is too complex to be supported by the data and may require some re-parameterisation. Naturally, the higher-order random effects (e.g. random slope of the third level (between district)) can be removed, especially where the variance and correlation terms are estimated on the boundaries (add bookdown reference).

##  Groups                       Name        Variance  Std.Dev. Corr  
##  qcaa_district:qcaa_school_id (Intercept) 3.8040956 1.950409       
##                               year92      0.0033141 0.057568 -0.870
##  qcaa_district                (Intercept) 0.1242578 0.352502       
##  Residual                                 0.0984432 0.313757
##              Estimate  Std. Error  t value
## (Intercept) 2.8959781 0.134506478 21.53040
## year92      0.0353323 0.002779433 12.71205
##  Number of Level Two groups =  481 
##  Number of Level Three groups =  13

To elaborate, two parameters were removed by setting variance components \(\phi^2_{10}\) = \(\phi_{01}\) equal to zero Which indirectly assumes that the growth rate for district \(j\) to be fixed. As shown in the model output above, this produced a more stable model and is free from any boundary constraints. As compared to the unconditional growth model (model1.0), the fixed effects remained rather similar.

Level one and level two will be identical to the unconditional growth model (model1.0), however, the random slope for level 3 will be removed. This implies that the error assumption at level three now follows a univariate normal distribution where \(r_{00j} \sim N(0, \phi_{00}^{2})\).

The new Level three (districts): \[\beta_{00j} = \gamma_{000} + r_{00j} \\ \beta_{10j} = \gamma_{100} \]

And therefore composite model: \[\begin{aligned} Y_{tij} &= \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ &= (\beta_{00j} + u_{0ij}) + (beta_{10j} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ &= (\gamma_{000} + r_{00j} + u_{0ij}) + (\gamma_{100} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ &= \left[\gamma_{000} + \gamma_{100}year92_{tij} \right] + \left[r_{00j} + u_{0ij} + u_{1ij}year92_{tij} + \epsilon_{tij} \right] \end{aligned}\]

Testing fixed effects

Table 18: AIC for all possible models with different combinations of fixed effects
model AIC
model4.0 15044.37
model4.1 15051.92
model4.7 15083.63
model4.5 15133.73
model4.6 15143.43
model4.4 15173.32
model4.9 15175.07
model4.2 15175.07
model4.8 15175.07
model4.10 15175.07
model4.3 15226.17

As highlighted in step 6, sector and unit will be added as predictors to the model. The largest possible model will be fitted, before removing fixed effects one by one while recording the AIC for each model. In this case, model4.0 corresponds to the largest possible model while model4.10 is the smallest possible model. The model with the optimal (lowest) AIC is model4.4 (Table 18). The next section will test the selected model’s random effects to build the final model.

Parametric bootstrap to test random effects

This step will not be undertaken, as the random slope will not be included at level three as a boundary constraint was found in the unconditional growth model, indicating that the model will be overfitted if random slopes were included at level three.

Confidence interval

Table 19: 95% confidence intervals for fixed and random effects in the final model
var 2.5 % 97.5 %
sd_(Intercept)|qcaa_district:qcaa_school_id 1.5242174 1.7454448
cor_year92.(Intercept)|qcaa_district:qcaa_school_id -0.8618092 -0.8004080
sd_year92|qcaa_district:qcaa_school_id 0.0460150 0.0533202
sd_(Intercept)|qcaa_district 0.1979323 0.5344465
sigma 0.3098662 0.3158212
(Intercept) 2.6985826 3.5356257
year92 0.0288684 0.0511286
sectorGovernment 0.2532983 1.0794554
sectorIndependent -2.1037594 -1.1831583
unityear_12_enrolments -0.0161797 0.0802241
year92:sectorGovernment -0.0434640 -0.0178734
year92:sectorIndependent 0.0139084 0.0424983
year92:unityear_12_enrolments -0.0018883 0.0030005
sectorGovernment:unityear_12_enrolments -0.2128654 -0.0980218
sectorIndependent:unityear_12_enrolments -0.0886348 0.0386853
year92:sectorGovernment:unityear_12_enrolments 0.0011711 0.0070114
year92:sectorIndependent:unityear_12_enrolments -0.0021703 0.0038465

The parametric bootstrap is utilised to construct confidence intervals (detailed explanation in step 8) for the random effects. If the confidence intervals between the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The confidence interval for the random effects all exclude 0 (Table 19), indicating that they’re different from 0 in the population (i.e. statistically significant).

Interpreting final model

Composite model

  • Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij}\]

  • Level two (schools within districts) will contain new predictor(sector) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12j}unit_{ij} + \beta_{13j}sector_{ij}unit_{ij} + u_{1ij} \end{aligned}\]

  • Level three (districts) \[\begin{aligned}\beta_{00j} &= \gamma_{000} + r_{00j} \\ \beta_{01j} &= \gamma_{010} + r_{01j} \\ \beta_{02j} &= \gamma_{020} + r_{02j} \\ \beta_{03j} &= \gamma_{030} + r_{03j} \\ \beta_{10j} &= \gamma_{100} \\ \beta_{11j} &= \gamma_{110} \\ \beta_{12j} &= \gamma_{120} \\ \beta_{13j} &= \gamma_{130}\end{aligned}\]

The composite model can therefore be written as: \[\begin{aligned} Y_{tij} =\ & \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ =\ & (\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12j}unit_{ij} + \beta_{13j}sector_{ij}unit_{ij} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + r_{00j} + (\gamma_{010} + r_{01j})sector_{ij} + (\gamma_{020} + r_{02j})unit_{ij} + (\gamma_{030} + r_{03j})sector_{ij}unit_{ij} + u_{0ij} \right] + \\ & \left[\gamma_{100} + \gamma_{110}sector_{ij} + \gamma_{120}unit_{ij} + \gamma_{130}sector_{ij}unit_{ij} + u_{1ij}\right]year92_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + \gamma_{010}sector_{ij} + \gamma_{020}unit_{ij} + \gamma_{030}sector_{ij}unit_{ij} + \gamma_{100}year92_{tij} + \gamma_{110}year92_{tij}sector_{ij} + \gamma_{120}unit_{ij}year92_{tij} + \gamma_{130}sector_{ij}unit_{ij}year92_{tij} \right] + \\ & \left[r_{00j} + r_{01j}sector_{ij} + r_{02j}unit_{ij} + r_{03j}sector_{ij}unit_{ij} + u_{0ij]} + u_{1ij}year92_{tij} + \epsilon_{tij} \right] \end{aligned}\]

Fixed effects

Fixed effects of the final model for General Mathematics subject

Figure 31: Fixed effects of the final model for General Mathematics subject

With a three-way interaction, it is easier to visualise the fixed effects (Figure 31). When the subject was first introduced in 1992, independent schools appears to have the least enrolments, on average. This low enrolment number was matched with a relatively large increase in enrolments per year, as seen by the slope. Although government schools have high enrolments initially, the rate of change in enrolments over the years increased relatively slow compared to the other sectors. Year 11 and year 12 units have similar enrolment numbers, on average.

Random effects

Random effects for all schools

Figure 32: Random effects for all schools

A large negative correlation between the random intercept and slope at the school level is apparent (Figure 32. This suggests that a larger school is associated with a smaller increase (decrease) in enrolments over the years while smaller schools are predicted to have large increase in enrolments over the years.

Random intercept for districts

Figure 33: Random intercept for districts

As the random slopes are removed, all districts are predicted to have the same increase in enrolments over the years; And as was discussed previously, this was a reasonable assumption or an otherwise perfect correlation with random slope and intercept will be fitted. Figure 33 demonstrates that schools in Gold Coast has the largest enrolments, on average.

Predictions

Model predictions for year 11 enrolments for 20 randomly selected schools

Figure 34: Model predictions for year 11 enrolments for 20 randomly selected schools

Figure 34 above shows the predictions for 20 randomly selected schools.

Predictions for Science Subjects

Agricultural Practices

Exploring the dataset with basic linear model for each school

Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Agricultural Practices

Figure 35: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Agricultural Practices

As described in step 1, a basic linear model was plotted for each school to provide insights of the enrolment trends for each school. Figure 35 demonstrates a distinct feature, in which there was a halt in the subject from 2000 to 2014. Schools 272 and 278 and 304 appers to re-introduce subject again after the halt, while some schools introduced the subject in 2015. Some enrolment patterns are rather erratic, such as having only one cohort taking the subject (school 613 and 636) or having a start increase in enrolment in one particular year (school 698).

Getting the data ready for modelling

Removing enrolments before year 2000 and graduating cohort 2019

Time plot of enrolments, demonstrating a halt in the subject from 2000 to 2014

Figure 36: Time plot of enrolments, demonstrating a halt in the subject from 2000 to 2014

A halt in Agricultural practices subject in 2000, before being re-introduced again in 2015 (Figure 36. Enrolments before year 2000 were rather small, as compared to enrolments from 2015 onwards. Therefore, enrolments before year 2000 will be removed.

As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. For these reasons, all completion years with zero enrolments will also be removed for modelling.

Linearise response variable using log transformation

Effects of log transformation for response variable (enrolments) in Agricultural Practices subject

Figure 37: Effects of log transformation for response variable (enrolments) in Agricultural Practices subject

As multilevel model assumes normality in the error terms, a log transformation is utilised to allow models to be estimated by the linear mixed models. The log transformation allows enrolment numbers to be approximately normally distributed (Figure 37).

Unconditional means model

Table 20: AIC values for all candidate models for Agricultural Practices
df AIC
Model0.0: Within schools 3 1573.659
Model0.2: Schools nested within districts 4 1574.922
Model0.1: Schools nested within postcodes 4 1575.659

As underlined in step 3, the three candidate models are fitted and their AIC is shown in Table 20. Based on the AIC, the two-level model (model0.0) is the best model and will be used in the subsequent analysis.

Intraclass correlation (\(ICC\))

summary(model0.0)
## Random effects:
##  Groups         Name        Variance Std.Dev.
##  qcaa_school_id (Intercept) 0.48659  0.69756 
##  Residual                   0.39251  0.62651
## 
##  Fixed effects:
##             Estimate Std. Error  t value
## (Intercept) 2.647161 0.07777087 34.03796
## 
##  Number of schools (level-two group) = 93 
##  Number of district (level-three group) = NA

The level-two ICC is the correlation between a school \(i\) in time \(t\) and time \(t^*\):

\[ \text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.4866}{(0.4866 + 0.3925)} = 0.5535\]

This can be conceptualised as the correlation between the enrolments of a selected school at two randomly drawn year (i.e. two randomly selected cohort from the same school). In other words, 55.35% of the total variability is attributable to the differences in enrolments within schools at different time periods.

Unconditional growth model

summary(model1.0)
##  Groups         Name        Variance Std.Dev. Corr  
##  qcaa_school_id (Intercept) 0.550453 0.741925       
##                 year15      0.006227 0.078912 -0.402
##  Residual                   0.257708 0.507650
##              Estimate Std. Error  t value
## (Intercept) 2.0127475 0.09392102 21.43021
## year15      0.1514085 0.01398980 10.82278
##  Number of Level Two groups =  93 
##  Number of Level Three groups =  NA

The next step involves incorporating the linear growth of time into the model. The model output is shown above.

  • \(\pi_{0ij}\) = 2.0128: Initial status for school \(i\) in district \(j\) (i.e. expected log enrolments when time = 0)
  • \(\pi_{1ij}\) = 0.1514: Growth rate for school \(i\) in district \(j\)
  • \(\epsilon_{tij}\) = 0.2577: Variance in within-school residuals after accounting for linear growth overtime

When the subject was re-introduced into the QCE system in 2015, schools were expected to have 7.4842 (\(e^{2.0128}\)) enrolment. On average, enrolments were expected to increase by a staggering 16.35% (\((e^{0.151408} - 1)\times100\)) per year. The estimated within-schools variance decreased by 34.34% (0.3925 to 0.2577), implying that 34.34% of within-school variability can be explained by the linear growth over time.

Testing fixed effects

Table 21: AIC for all possible models with different combinations of fixed effects
model npar AIC BIC logLik
model4.5 12 1327.094 1381.961 -651.5468
model4.4 11 1328.231 1378.526 -653.1157
model4.3 10 1329.521 1375.244 -654.7606
model4.1 14 1330.494 1394.506 -651.2470
model4.7 13 1331.572 1391.012 -652.7862
model4.6 12 1333.059 1387.926 -654.5293
model4.0 16 1333.719 1406.875 -650.8593
model4.9 11 1334.117 1384.412 -656.0585
model4.2 11 1334.117 1384.412 -656.0585
model4.8 11 1334.117 1384.412 -656.0585
model4.10 11 1334.117 1384.412 -656.0585

As outlined in step 6, sector and unit will be added as predictors to the model. The largest possible model (model4.0) will then be fitted, before removing the fixed effects one at a time (with model4.10 being the smallest of all 10 candidate models), while recording the AIC for each model. model4.5 appears to have the optimal (smallest) AIC (Table 21), and will be used in the next section to build the final model.

Parametric bootstrap to test random effects

Table 22: Parametric Bootstrap to compare larger and smaller, nested model
npar AIC BIC logLik deviance Chisq Df Pr_boot(>Chisq)
10 1330.958 1376.681 -655.4792 1310.958 NA NA NA
12 1327.094 1381.961 -651.5468 1303.094 7.864871 2 0.009
Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model

Figure 38: Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model

The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. Figure 38 displays the likelihood ratio test statistic from the null distribution, with the red line indicates the likelihood ratio test statistic using the actual data.

The p-value indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic. The low estimated \(p\)-value is 0.009 < 0.05 (Table 22) rejects the null hypothesis at the 5% level, indicating that the larger model (including random slope at level two) is preferred.

Confidence interval

Table 23: 95% confidence interval for the fixed and random effects
var 2.5 % 97.5 %
sd_(Intercept)|qcaa_school_id 0.5410851 0.8275173
cor_year15.(Intercept)|qcaa_school_id -0.6286851 0.1177694
sd_year15|qcaa_school_id 0.0360913 0.0971561
sigma 0.4719375 0.5290075
(Intercept) 1.5118458 2.6445580
year15 -0.0251663 0.1676654
sectorGovernment -0.3715192 0.7707218
sectorIndependent -1.3320193 0.2141525
unityear_12_enrolments -0.4283373 -0.1388208
year15:sectorGovernment -0.0312332 0.1723089
year15:sectorIndependent 0.0272586 0.2881371
year15:unityear_12_enrolments -0.0002587 0.0607488

The parametric bootstrap is utilised to construct confidence intervals (detailed explanation in step 8) for the random effects. If the confidence intervals between the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The confidence interval for the random effects all exclude 0 (Table 23), indicating that they’re different from 0 in the population (i.e. statistically significant).

Interpreting the final model

Composite model

  • Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year15_{tij} + \epsilon_{tij}\]

  • Level two (schools within postcodes) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01}sector_{ij} + \beta_{02}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12j}unit_{ij} + u_{1ij} \end{aligned}\]

Therefore, the composite model can be written as

\[\begin{aligned}Y_{tij} =\ & \pi_{0ij} + \pi_{1ij}year15_{tij} + \epsilon_{tij} \\ =\ & (\beta_{00j} + \beta_{01}sector_{ij} + \beta_{02}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12j}unit_{ij} + u_{1ij})year15_{tij} + \epsilon_{tij} \\ =\ & \left[\beta_{00j} + \beta_{01}sector_{ij} + \beta_{02}unit_{ij} + \beta_{10j}year15_{tij} + \beta_{11j}sector_{ij}year15_{tij} + \beta_{12j}unit_{ij}year15_{tij} \right] \left[u_{0ij} + u_{1ij} + \epsilon_{tij} \right] \end{aligned}\]

Fixed effects

summary(model_f)
##  Groups         Name        Variance  Std.Dev. Corr  
##  qcaa_school_id (Intercept) 0.4842622 0.695890       
##                 year15      0.0048578 0.069698 -0.337
##  Residual                   0.2509081 0.500907
##                                  Estimate Std. Error   t value
## (Intercept)                    2.03638008 0.29670877  6.863228
## year15                         0.06877268 0.04919228  1.398038
## sectorGovernment               0.22111834 0.31239875  0.707808
## sectorIndependent             -0.53317803 0.37319648 -1.428679
## unityear_12_enrolments        -0.27868693 0.07555397 -3.688581
## year15:sectorGovernment        0.06990061 0.05093108  1.372455
## year15:sectorIndependent       0.14926780 0.06131800  2.434323
## year15:unityear_12_enrolments  0.02866851 0.01629460  1.759388
##  Number of Level Two groups =  93 
##  Number of Level Three groups =  NA

Based on the model output, the estimated mean enrolments for government schools are estimated to be 24.75% (\(e^{0.2211} - 1 \times 100\)) more than that of catholic schools when the subject was re-introduced in 2015. Government schools are also estimated to have an estimated increase in mean enrolments of 14.87% (\((e^{0.0687727 + 0.0699006} - 1) \times 100\)) per year, which is 7.24% more than that of catholic schools.

On the other hand, independent schools are estimated to have mean enrolments of 41.32% (\((e^{-0.5331780} - 1) \times 100\)) less than that of catholic schools in 2015. This low initial status is matched with a relatively large increase of 24.36% in mean enrolments per year, which is a staggering 16.10% more than that of catholic schools, on average.

Fixed effects of the final model for Agricultural Practices subject

Figure 39: Fixed effects of the final model for Agricultural Practices subject

Based on the fixed effects (see step 9 for detailed explanation), independent schools show the greatest rate of change (as indicated by the slope in Figure 39), indicating that enrolment numbers are increasing at the highest rate as compare to all other sectors. For all sectors, unit 12 appears to have a larger slope than that of unit 11, indicating that there are increasingly more unit 12 enrolments than unit 11. It appears that after 2025, unit 12 enrolments are generally more than that of unit 11. Catholic showed a rather slow increase in enrolments relatively to the other sectors.

Random effects

Random effects for all schools

Figure 40: Random effects for all schools

Figure 40 represents the random intercept and slope of the random effects for all 93 schools who offered Agricultural Practices subject. There are no distinct correlations between the random intercept and slope.

Predictions

Model predictions for year 11 enrolments for 20 randomly selected schools

Figure 41: Model predictions for year 11 enrolments for 20 randomly selected schools

Figure 41 above shows the predictions for 20 randomly selected schools.

Agricultural Science

Exploring the dataset with basic linear model

Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Agricultural Science subject

Figure 42: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Agricultural Science subject

Figure 42 fits a linear model for 20 randomly selected schools. In most of these selected schools, the linear model captured a downward trend in enrolments. Some schools showed a increase, and in particular, school 288 showed a significant increase in enrolments as compared to the other selected schools. It seems that most schools that ceased offering the subject in the later years (e.g. schools 365, 490, 567) showed a large downward trend before ceasing the subject. Interestingly, school 508 displayed a rather cubic trend, where enrolments increases exponentially from 2002 before decrease till 2015 and picking up again.

Getting the data ready for modelling

Removing zero enrolments

All zero enrolments in a given year will be removed for modelling. As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. These zero enrolments will be removed for modelling purposes.

Linearise response variable using log transformation

Effects of log transformation for response variable (enrolments) in Agricultural Science subject

Figure 43: Effects of log transformation for response variable (enrolments) in Agricultural Science subject

The enrolments were right skewed, which is likely to be attributed to the various school sizes (as seen in Figure 43). A log transformation was implemented to the response variable (i.e. enrolments) to allow the the multilevel model to better capture the enrolment patterns.

Unconditional means model

Table 24: AIC values for all candidate models for Agricultural Science
df AIC
Model0.1: Schools nested within postcodes 4 5146.230
Model0.0: Within schools 3 5149.113
Model0.2: Schools nested within districts 4 5149.856

As outlined in step 3, the three candidate models are fitted and their AIC is shown in Table 24. Based on the AIC, the two-level model (model0.1) corresponding to schools nested within postcodes is the superior model and will be used in the subsequent analysis.

Intraclass correlation (\(ICC\))

summary(model0.1)
## Random effects:
##  Groups                         Name        Variance Std.Dev.
##  qcaa_school_id:school_postcode (Intercept) 0.40515  0.63651 
##  school_postcode                (Intercept) 0.12275  0.35036 
##  Residual                                   0.29180  0.54018
## 
##  Fixed effects:
##             Estimate Std. Error  t value
## (Intercept) 2.018581 0.07739906 26.08017
## 
##  Number of schools (level-two group) = 112 
##  Number of school postcodes (level-three group) = 81

This model will takes into account 112 schools nested in 81 postcodes. In a three-level multilevel model, two intraclass correlations can be obtained using the model summary output above:

The level-two ICC relates to the correlation between school \(i\) from a certain postcode \(k\) in time \(t\) and in time \(t^* \ne t\):

\[\text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.4052}{(0.4052 + 0.1228 + 0.2918)} = 0.4943\]

This can be conceptualised as the correlation between enrolments of two random draws from the same school at two different years. In other words, 49.45% of the total variability is attributable to the changes over time within schools.

The level-three ICC refers to the correlation between different schools \(i\) and \(i^*\) from a specific postcode \(j\).

\[\text{Level-three ICC} = \frac{\phi^2_{00}}{\tau^2_{00} + \phi^2_{00} + \sigma^2} = \frac{0.1228}{(0.4052 + 0.1228 + 0.2918)} = 0.1498\]

Likewise, this can be loosely translated to be the correlation between enrolments of two random draws from two schools from two different postcode. In this case, 14.98% of the total variability is due to the difference between postcodes.

Unconditional growth model

The unconditional growth model introduces the time predictor at level one, the model specification can be found in step 4. This allows for assessing within-school variability which can be attributed to linear changes over time. Furthermore, variability in intercepts and slopes can be obtained to compare schools within the same postcodes, and schools from different postcodes.

summary(model1.0)
##  Groups                         Name        Variance  Std.Dev. Corr  
##  school_postcode:qcaa_school_id (Intercept) 1.0049220 1.002458       
##                                 year94      0.0013985 0.037397 -0.906
##  school_postcode                (Intercept) 0.1565287 0.395637       
##                                 year94      0.0002277 0.015090 -0.669
##  Residual                                   0.2539595 0.503944
##               Estimate  Std. Error   t value
## (Intercept) 1.83876433 0.119841404 15.343314
## year94      0.01227337 0.004831677  2.540188
##  Number of Level Two groups =  112 
##  Number of Level Three groups =  81
  • \(\pi_{0ij}\) = 1.8388: Initial status for school \(i\) in postcode \(j\) (i.e. expected log enrolments when time = 0)
  • \(\pi_{1ij}\) = 0.0123: Growth rate for school \(i\) in postcode \(j\)
  • \(\epsilon_{tij}\) = 0.2540: Variance in within-school residuals after accounting for linear growth overtime

When the subject was first introduced in 1994, schools were expected to have 6.2888 (\(e^{1.8388}\)) enrolments, on average. As displayed in Figure 3, the average enrolments for agricultural science is relatively low, where the enrolments during the old QCE system had approximately 10 enrolments while the new QCE system had roughly 15 enrolments per school, on average.

Enrolments were expected to increase by 1.2349% (\((e^{0.01227} - 1)\times100\)) per year. The estimated within-schools variance decreased by 12.968% (0.2918 to 0.2539595), implying that 12.968% of within-school variability can be explained by the linear growth over time.

Testing fixed effects

Table 25: AIC for all possible models with different combinations of fixed effects
model npar AIC BIC logLik
model4.3 13 4841.696 4919.622 -2407.848
model4.6 15 4841.897 4931.812 -2405.949
model4.5 15 4843.214 4933.129 -2406.607
model4.1 17 4843.639 4945.542 -2404.819
model4.0 19 4847.303 4961.195 -2404.651
model4.10 14 4848.524 4932.444 -2410.262
model4.9 14 4848.524 4932.444 -2410.262
model4.2 14 4848.524 4932.444 -2410.262
model4.8 14 4848.524 4932.444 -2410.262
model4.4 14 4850.301 4934.221 -2411.151
model4.7 16 4850.361 4946.270 -2409.181

As outlined in step 6, sector and unit will be added as predictors to the model. The largest possible model (model4.0) will then be fitted, before removing the fixed effects one at a time (with model4.10 being the smallest of all 10 candidate models), while recording the AIC for each model. model4.3 appears to have the optimal (smallest) AIC (Table 25), and will be used in the next section to build the final model.

Parametric bootstrap to test random effects

Table 26: Parametric Bootstrap to compare larger and smaller, nested model
npar AIC BIC logLik deviance Chisq Df Pr_boot(>Chisq)
11 4837.855 4903.793 -2407.928 4815.855 NA NA NA
13 4841.696 4919.622 -2407.848 4815.696 0.1593719 2 0.718
Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model

Figure 44: Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model

The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. Figure 44 displays the likelihood ratio test statistic from the null distribution, with the red line indicates the likelihood ratio test statistic using the actual data.

The p-value of 0.718% (Table 26) indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic. The large estimated \(p\)-value is 0.718 < 0.05 fails to reject the null hypothesis at the 5% level, indicating that the smaller model (without random slope at level three) is preferred.

Confidence interval

Table 27: 95% confidence intervals for fixed and random effects in the final model
var 2.5 % 97.5 %
sd_(Intercept)|school_postcode:qcaa_school_id 0.8530526 1.2182627
cor_year94.(Intercept)|school_postcode:qcaa_school_id -0.9690616 -0.8565112
sd_year94|school_postcode:qcaa_school_id 0.0315056 0.0460868
sd_(Intercept)|school_postcode 0.0000009 0.4581153
sigma 0.4876907 0.5140716
(Intercept) 1.4227924 2.2504155
year94 0.0005497 0.0184106
sectorGovernment -0.3117705 0.4153108
sectorIndependent -0.0628937 0.7950850
unityear_12_enrolments -0.2747165 -0.1289871
year94:unityear_12_enrolments 0.0026651 0.0107254

The parametric bootstrap is utilised to construct confidence intervals (as detailed in step 8). If the confidence intervals for the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level.

The 95% confidence interval is shown above, and the random effects all exclude 0, further reiterating that they are statistically significant at the 5% level. Some fixed effects such as unityear_12_enrolments were insignificant, suggesting that there were no differences between unit 11 and unit 12 units.

Interpreting final model

Composite model

  • Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year94_{tij} + \epsilon_{tij}\]

  • Level two (schools within postcodes) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}unit_{ij} + u_{1ij} \end{aligned}\]

  • Level three (postcodes) \[\begin{aligned}\beta_{00j} &= \gamma_{000} + r_{00j} \\ \beta_{01j} &= \gamma_{010} + r_{01j} \\ \beta_{02j} &= \gamma_{020} + r_{02j} \\ \beta_{10j} &= \gamma_{100} \\ \beta_{11j} &= \gamma_{110} \end{aligned}\]

Therefore, the composite model can be written as:

\[\begin{aligned} Y_{tij} =\ & \pi_{0ij} + \pi_{1ij}year94_{tij} + \epsilon_{tij} \\ =\ & (\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}unit_{ij} + u_{1ij})year94_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + r_{00j} + (\gamma_{010} + r_{01j})sector_{ij} + (\gamma_{020} + r_{02j})unit_{ij} + u_{0ij} \right] + \left[\gamma_{100} + \gamma_{110}unit_{ij} + u_{1ij}\right]year94_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{0000} + \gamma_{010}sector_{ij} + \gamma_{020}unit_{ij} + \gamma_{100}year94_{tij} + \gamma_{110}unit_{ij}year94_{tij} \right] + \left[r_{00j} + r_{01j}sector_{ij} + r_{02j}unit_{ij} + u_{0ij} + u_{1ij}year94_{tij} + \epsilon_{tij} \right] \end{aligned}\] ### Fixed effects

summary(model_f)
##  Groups                         Name        Variance  Std.Dev. Corr  
##  school_postcode:qcaa_school_id (Intercept) 1.0717490 1.035253       
##                                 year94      0.0015528 0.039406 -0.916
##  school_postcode                (Intercept) 0.0812513 0.285046       
##  Residual                                   0.2507071 0.500707
##                                   Estimate  Std. Error    t value
## (Intercept)                    1.821947999 0.198742787  9.1673667
## year94                         0.009460629 0.004666531  2.0273366
## sectorGovernment               0.048533174 0.176743531  0.2745966
## sectorIndependent              0.364311290 0.196685041  1.8522572
## unityear_12_enrolments        -0.204347826 0.037573894 -5.4385587
## year94:unityear_12_enrolments  0.006683023 0.002232326  2.9937484
##  Number of Level Two groups =  112 
##  Number of Level Three groups =  81

Notably, the model assumes that enrolments in different postcodes are assumed to increase at the same rate (as justified by the parametric bootstrap while testing for random effects). Using the model output above (see step 9 for detailed explanation on fixed effects), the estimated increase in mean enrolments for schools between postcodes are estimated to increase at a rate of 0.9506% per year.

Fixed effects of the final model for Agricultural Science

Figure 45: Fixed effects of the final model for Agricultural Science

Based on the fixed effects, independent schools are expected to have the highest enrolments over the years. In all sectors, unit 12 have lower initial status (i.e. lower enrolments when subject is first introduced), however, they are expected to increase at a much higher rate relative to unit 11 units. Intuitively, this suggests that on average, there may be more students taking this subject in year 10, before re-enrolling again in year 12, or in the more extreme case, students are dropping the subject after completing year 11.

Random effects

A clear negative correlation in the random intercept and slope can be distinguished, indicating that schools with larger enrolments when the subject was first introduced are expected to show a smaller increase (decrease) in enrolments over the years. Based on the model output, the correlation between random intercept and slope was -0.92.

Random intercept for districts

Figure 46: Random intercept for districts

As there is no random slope, enrolments for the across all postcodes are estimated to be the same (justified by the parametric bootstrap). As shown in Figure 46, most of the variations are accounted for in the random intercept, which suggests that some postcodes are associated with larger schools (and enrolments) relative to other postcodes. Schools within postcode 4350 (In Toowoomba district) are predicted to have the highest initial status while schools within postcode 4570 (Wide Bay district) are predicted to have the lowest initial status.

Predictions

Model predictions for 20 randomly selected schools

Figure 47: Model predictions for 20 randomly selected schools

Aquatic Practices

Exploring the dataset with basic linear model for each school

Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Aquatic Practices

Figure 48: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Aquatic Practices

As shown previously (Figure 3), Aquatic Practices is a relatively new subject introduced in 2015. The plot above (Figure 48) displays a linear model fitted for each school (explained in step 1), primarily to provide an at-a-glance visualisation to the enrolment trends in each school.

In the randomly selected schools, the various school sizes is distinct, where there were relatively larger schools such as school 72 and 442, which showed enrolments of over 100 students while schools 570 and 579 consistently had enrolments below 50 for each year. Some schools showed a stark increase in enrolments (e.g. school 72 and 442), while some school showed rather constant growth enrolments (e.g. School 493 and 517).

Getting the data ready for modelling

Removing graduating cohort 2019

As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. For these reasons, all completion years with zero enrolments will also be removed for modelling.

Linearise response variable using log transformation

Effects of log transformation for response variable (enrolments) in Aquatic Practices

Figure 49: Effects of log transformation for response variable (enrolments) in Aquatic Practices

As multilevel model assumes normality in the error terms, a log transformation is utilised to allow models to be estimated by the linear mixed models. The log transformation allows enrolment numbers to be approximately normally distributed (Figure 49).

Unconditional means model

Table 28: AIC values for all candidate models for Aquatic Practices
df AIC
Model0.0: Within schools 3 2505.504
Model0.2: Schools nested within districts 4 2506.955
Model0.1: Schools nested within postcodes 4 2507.105

As underlined in step 3, the three candidate models are fitted and their AIC is shown in Table 28. Based on the AIC, the two-level model (model0.0) is the best model and will be used in the subsequent analysis.

Intraclass correlation (\(ICC\))

summary(model0.0)
## Random effects:
##  Groups         Name        Variance Std.Dev.
##  qcaa_school_id (Intercept) 0.69382  0.83296 
##  Residual                   0.38336  0.61916
## 
##  Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 3.308375 0.07911121 41.8193
## 
##  Number of schools (level-two group) = 120 
##  Number of district (level-three group) = NA

This model takes into account 120 schools. For a two-level multilevel model, the level two intraclass correlation coefficient (ICC) can be computed using the model output above.

The level-two ICC is the correlation between a school \(i\) in time \(t\) and time \(t^*\):

\[\text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.6938}{(0.6938 + 0.3834)} = 0.6441\]

This can be conceptualised as the correlation between the enrolments of a selected school at two randomly drawn year (i.e. two randomly selected cohort from the same school). In other words, 64.41% of the total variability is attributable to the differences in enrolments within schools at different time periods.

Unconditional growth model

summary(model1.0)
##  Groups         Name        Variance  Std.Dev. Corr  
##  qcaa_school_id (Intercept) 0.9608954 0.980253       
##                 year15      0.0069467 0.083347 -0.534
##  Residual                   0.1964707 0.443250
##              Estimate Std. Error  t value
## (Intercept) 2.5658807 0.09690281 26.47891
## year15      0.1791002 0.01069684 16.74328
##  Number of Level Two groups =  120 
##  Number of Level Three groups =  NA

The next step involves incorporating the linear growth of time into the model. The model output is shown above.

  • \(\pi_{0ij}\) = 2.5659: Initial status for school \(i\) (i.e. expected log enrolments when time = 0)
  • \(\pi_{1ij}\) = 0.0353: Growth rate for school \(i\)
  • \(\epsilon_{tij}\) = 0.2458: Variance in within-school residuals after accounting for linear growth overtime

When the subject was first introduced in 2015, schools were expected to have an average of 13.0124 (\(e^{2.56588}\)) enrolments. On average, the enrolments were expected to increase by 19.614% (\((e^{0.0353} - 1) \times 100\)) per year. The estimated within-school variance decreased by 48.75% (0.3834 to 0.1965), indicating the 48.75% can be explained by the linear growth in time.

Testing fixed effects

Table 29: AIC for all possible models with different combinations of fixed effects
model npar AIC BIC logLik
model4.5 12 1869.409 1929.989 -922.7043
model4.1 14 1871.986 1942.663 -921.9928
model4.4 11 1874.380 1929.912 -926.1898
model4.0 16 1874.525 1955.299 -921.2624
model4.7 13 1876.474 1942.103 -925.2370
model4.3 10 1881.588 1932.072 -930.7942
model4.6 12 1884.298 1944.879 -930.1491
model4.9 11 1888.825 1944.357 -933.4124
model4.10 11 1888.825 1944.357 -933.4124
model4.2 11 1888.825 1944.357 -933.4124
model4.8 11 1888.825 1944.357 -933.4124

As summarised in step 6, level-two predictors sector and unit will be added to the model. The largest possible model (model4.0) will first be fitted, before iteratively removing fixed effects one at a time (with model4.10 being the smallest of all 10 candidate models), whilst recording the AIC for each model. model4.5 (Table 29) appears to have the optimal (smallest) AIC, and will be used in the next section in building the final model.

Parametric bootstrap to test random effects

Table 30: Parametric Bootstrap to compare larger and smaller, nested model
npar AIC BIC logLik deviance Chisq Df Pr_boot(>Chisq)
10 1907.775 1958.259 -943.8874 1887.775 NA NA NA
12 1869.409 1929.989 -922.7043 1845.409 42.36629 2 0
Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model

Figure 50: Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model

The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. The p-value indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic (as indicated by the red line in Figure 50. There is overwhelming statistical evidence (\(\chi^2\) = 42.3663 and \(p\)-value = 0 – see Table 30) that the larger model (including random slope at level two) is the better model.

Confidence interval

Table 31: 95% confidence intervals for fixed and random effects in the final model
var 2.5 % 97.5 %
sd_(Intercept)|qcaa_school_id 0.7694684 1.0336552
cor_year15.(Intercept)|qcaa_school_id -0.7092237 -0.3083679
sd_year15|qcaa_school_id 0.0557664 0.0927001
sigma 0.4187891 0.4583143
(Intercept) 2.3796251 3.5004024
year15 -0.0076337 0.1346439
sectorGovernment -0.7578332 0.4846176
sectorIndependent -2.4092098 -0.8681366
unityear_12_enrolments -0.3080638 -0.1124896
year15:sectorGovernment 0.0376980 0.1846632
year15:sectorIndependent 0.0957980 0.2969844
year15:unityear_12_enrolments 0.0095620 0.0524932

The parametric bootstrap is utilised to construct confidence intervals (detailed explanation in step 8) for the random effects. If the confidence intervals between the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The confidence interval for the random effects all exclude 0, indicating that they’re different from 0 in the population (i.e. statistically significant).

Composite model

  • Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year15_{tij} + \epsilon_{tij}\]

  • Level two (schools within postcodes) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01}sector_{ij} + \beta_{02}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12j}unit_{ij} + u_{1ij} \end{aligned}\]

Therefore, the composite model can be written as

\[\begin{aligned}Y_{tij} =\ & \pi_{0ij} + \pi_{1ij}year15_{tij} + \epsilon_{tij} \\ =\ & (\beta_{00j} + \beta_{01}sector_{ij} + \beta_{02}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12j}unit_{ij} + u_{1ij})year15_{tij} + \epsilon_{tij} \\ =\ & \left[\beta_{00j} + \beta_{01}sector_{ij} + \beta_{02}unit_{ij} + \beta_{10j}year15_{tij} + \beta_{11j}sector_{ij}year15_{tij} + \beta_{12j}unit_{ij}year15_{tij} \right] \left[u_{0ij} + u_{1ij} + \epsilon_{tij} \right] \end{aligned}\]

Fixed effects

summary(model_f)
##  Groups         Name        Variance Std.Dev. Corr  
##  qcaa_school_id (Intercept) 0.802676 0.895922       
##                 year15      0.005632 0.075047 -0.537
##  Residual                   0.192796 0.439085
##                                  Estimate Std. Error    t value
## (Intercept)                    2.96690311 0.28548780 10.3923990
## year15                         0.05859201 0.03459941  1.6934394
## sectorGovernment              -0.14916989 0.30209047 -0.4937921
## sectorIndependent             -1.64252982 0.39342178 -4.1749844
## unityear_12_enrolments        -0.20422624 0.05216529 -3.9149833
## year15:sectorGovernment        0.11332298 0.03601076  3.1469199
## year15:sectorIndependent       0.19500364 0.04787424  4.0732474
## year15:unityear_12_enrolments  0.03002428 0.01136349  2.6421712
##  Number of Level Two groups =  120 
##  Number of Level Three groups =  NA

Based on the model output, the estimated mean enrolments for government schools are estimated to be 13.8577% (\((e^{-0.1491699} - 1) \times 100\)) less than that of catholic schools when the subject was first introduced in 2015. Government schools are also expected to have an average increase of 18.7577% (\((e^{0.0585920 + 0.1133230} - 1) \times 100\)), 11.9994% \(((e^{0.1133230} - 1) \times 100)\) more than that of catholic schools per year, .

On the other hand, independent schools are estimated to have an average enrolments of 80.651% (\((e^{-1.6425298} - 1) \times 100\)) less than that of catholic schools. However, this small initial status is matched with a 28.8651% \((e^{.0585920 + 0.1950036} - 1) \times 100\) increase in enrolments per year, on average. This increase is 21.5315% (\((e^{0.1950036} - 1) \times 100\)) greater than that of catholic schools.

Fixed effects of the final model for Aquatic Practices subject

Figure 51: Fixed effects of the final model for Aquatic Practices subject

The fixed effects can be visualised in Figure 51. As mentioned, catholic schools had the highest enrolments score in 2015, but had a slow increase in enrolments over the years; By 2018, government schools had greater enrolments, on average than catholic schools, and it is expected that independent schools have larger enrolments numbers, on average, than catholic schools. In all sectors, unit 12 appears to increase at a higher rate than that of unit 11, as shown by the steeper slope.

Random effects

In the random effects, there is a some negative correlation between the random intercept and slope, where schools with lower enrolments when subject was first introduced are matched with a higher increase (decrease) in enrolments over the years. However, this negative relationship was not relatively strong, where the correlation between the random intercept and slope is only at -0.54, as shown in the model output.

Predictions

Model predictions for year 11 enrolments for 20 randomly selected schools

Figure 52: Model predictions for year 11 enrolments for 20 randomly selected schools

Figure 52 above shows the predictions for 20 randomly selected schools.

Biology

Exploring the dataset with basic linear model for each school

Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Biology subject

Figure 53: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Biology subject

As described in step 1, a basic linear model was plotted for each school to provide insights of the enrolment trends for each school. Figure 53 demonstrates Biology may be introduced in schools in later years (e.g. school 235) and schools may have removed the subject and in some cases, school 1 and school 645 did not exist after 1997. School 570 showed a halt in the subject from 2005 to 2014. Schools can also vary greatly in enrolment sizes, for instance, schools 129, 132 and 570 had less than 50 enrolments for every cohort while some schools (e.g. school 235) have relatively larger cohort size. Some schools also showed a general decrease in enrolments across the years while some schools such at school 5 and 470 showed a significant increase in enrolments over the years.

Getting the data ready for modelling

Removing graduating cohort 2019

All zero enrolments in a given year will be removed for modelling. As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year.

Linearise relationship using log transformation

Effects of log transformation for response variable (enrolments) in Biology subject

Figure 54: Effects of log transformation for response variable (enrolments) in Biology subject

A log transformation was used on the response variable (enrolments) to allow model to be estimated by the multilevel model, which assumes normality in the error terms. As shown in Figure 54.

Unconditional means model

Table 32: AIC values for all candidate models for Biology
df AIC
Model0.2: Schools nested within districts 4 30247.41
Model0.1: Schools nested within postcodes 4 30284.86
Model0.0: Within schools 3 30290.46

Referring back to step 3, three candidate models are fitted, with the AIC shown in Table 32. Model0.2, corresponding to having schools nested within districts is the best model, with optimised (lowest) AIC and will be used in the subsequent analysis.

Intraclass correlation (\(ICC\))

summary(model0.2)
## Random effects:
##  Groups                       Name        Variance Std.Dev.
##  qcaa_school_id:qcaa_district (Intercept) 0.61534  0.78444 
##  qcaa_district                (Intercept) 0.11133  0.33366 
##  Residual                                 0.22037  0.46943
## 
##  Fixed effects:
##             Estimate Std. Error  t value
## (Intercept) 3.271578 0.09968013 32.82077
## 
##  Number of schools (level-two group) = 468 
##  Number of district (level-three group) = 13

In a three-level multilevel model, two intraclass correlations can be obtained using the model summary output above:

The level-two ICC relates to the correlation between school \(i\) from a certain district \(k\) in time \(t\) and in time \(t^*\):

\[\text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.6153}{(0.6153 + 0.1113 + 0.2204)} = 0.6497\]

This can be conceptualised as the correlation between enrolments of two random draws from the same school at two different years. In other words, 64.97% of the total variability is attributable to the differences between schools from the same district rather than changes over time within schools.

The level-three ICC refers to the correlation between different schools \(i\) and \(i^*\) from a specific school \(j\).

\[\text{Level-three ICC} = \frac{\phi^2_{00}}{\tau^2_{00} + \phi^2_{00} + \sigma^2} = \frac{0.1113}{(0.6153 + 0.1113 + 0.1113)} = 0.1175\]

Similarly, it can be inferred that the correlation between enrolments of two randomly selected schools from different districts are 11.75%, where the total variability can be attributed to the difference between districts.

Unconditional growth model

summary(model1.0)
##  Groups                       Name        Variance   Std.Dev.  Corr  
##  qcaa_district:qcaa_school_id (Intercept) 2.1466e+00 1.4651162       
##                               year92      2.3778e-03 0.0487625 -0.805
##  qcaa_district                (Intercept) 6.5640e-02 0.2562035       
##                               year92      6.0662e-05 0.0077886 0.409 
##  Residual                                 1.5031e-01 0.3876979
##               Estimate  Std. Error   t value
## (Intercept) 2.68260254 0.101014731 26.556548
## year92      0.02593676 0.003253587  7.971744
##  Number of Level Two groups =  468 
##  Number of Level Three groups =  13

The unconditional growth model adds the systematic changes over time, the model specification can be found in step 4. This allows for assessing within-school variability which can be attributed to the linear changes over time. Based on the model output:

  • \(\pi_{0ij}\) = 2.6826: Initial status for school \(i\) in district \(j\) (i.e. expected log enrolments when time = 0)
  • \(\pi_{1ij}\) = 0.0259: Growth rate for school \(i\) in district \(j\)
  • \(\epsilon_{tij}\) = 0.1503: Variance in within-school residuals after accounting for linear growth overtime

Biology was first introduced in 1992, and schools are expected to have 14.62 (\(e^{2.68260}\)), on average. Furthermore, enrolments were expected to increase by 2.628% (\(e^{0.02594} - 1) \times 100\)) every year. The estimated within-school variance decrease by 24.49% (0.2204 to 0.1503), implying that 24.49% of the within-school variability can be explained by the linear growth over time.

Testing fixed effects

Table 33: AIC for all possible models with different combinations of fixed effects
model npar AIC BIC logLik
model4.1 17 23783.03 23918.34 -11874.51
model4.0 19 23784.32 23935.55 -11873.16
model4.7 16 23784.71 23912.06 -11876.35
model4.5 15 23789.67 23909.07 -11879.84
model4.4 14 23791.07 23902.50 -11881.53
model4.9 14 23853.67 23965.11 -11912.84
model4.10 14 23853.67 23965.11 -11912.84
model4.2 14 23853.67 23965.11 -11912.84
model4.8 14 23853.67 23965.11 -11912.84
model4.3 13 23858.62 23962.10 -11916.31
model4.6 15 23859.14 23978.53 -11914.57

As detailed in step 6, level-two predictors (sector and unit) are added to the model. The largest possible model will be fitted, before removing each fixed effect one by one whilst recording the AIC for each model. model4.0 corresponds to the largest model while model4.10 is the smallest possible model. The model with the optimal (lowest) AIC is model4.1, and will be used in subsequent sections.

Parametric bootstrap to test random effects

Table 34: Parametric Bootstrap to compare larger and smaller, nested model
npar AIC BIC logLik deviance Chisq Df Pr_boot(>Chisq)
15 23781.72 23901.12 -11875.86 23751.72 NA NA NA
17 23783.03 23918.35 -11874.51 23749.03 2.694776 2 0.126
Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model

Figure 55: Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model

The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. Figure 55 displays the likelihood ratio test statistic from the null distribution, with the red line representing the likelihood ratio test statistic using the actual data. The p-value of 14.5% (Table 34) indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic. The large estimated \(p\)-value is 0.145 < 0.05 fails to reject the null hypothesis at the 5% level, indicating that the smaller model (without random slope at level three) is preferred.

Confidence interval

Table 35: 95% confidence intervals for fixed and random effects in the final model
var 2.5 % 97.5 %
sd_(Intercept)|school_postcode:qcaa_school_id 0.8530526 1.2182627
cor_year94.(Intercept)|school_postcode:qcaa_school_id -0.9690616 -0.8565112
sd_year94|school_postcode:qcaa_school_id 0.0315056 0.0460868
sd_(Intercept)|school_postcode 0.0000009 0.4581153
sigma 0.4876907 0.5140716
(Intercept) 1.4227924 2.2504155
year94 0.0005497 0.0184106
sectorGovernment -0.3117705 0.4153108
sectorIndependent -0.0628937 0.7950850
unityear_12_enrolments -0.2747165 -0.1289871
year94:unityear_12_enrolments 0.0026651 0.0107254

The parametric bootstrap is utilised to construct confidence intervals (as detailed in step 8). If the confidence intervals for the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level.

The 95% confidence interval is shown above (Table 35), and the random effects all exclude 0, further reiterating that they are statistically significant at the 5% level. Some fixed effects such as unityear_12_enrolments were insignificant, suggesting that there were no differences between unit 11 and unit 12 units.

Interpreting the final model

Composite model

  • Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij}\]

  • Level two (schools within districts) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij} \end{aligned}\]

  • Level three (districts) \[\begin{aligned} \beta_{00j} &= \gamma_{000} + r_{00j} \\ \beta_{01j} &= \gamma_{010} + r_{01j} \\ \beta_{02j} &= \gamma_{020} + r_{02j} \\ \beta_{03j} &= \gamma_{030} + r_{03j} \\ \beta_{10j} &= \gamma_{100} \\ \beta_{11j} &= \gamma_{110} \end{aligned}\]

Therefore, the composite model can be written as

\[\begin{aligned} Y_{tij} =\ &\pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ =\ &(\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ =\ & \left[(\gamma_{000} + r_{00j}) + (\gamma_{010} + r_{01j})sector_{ij} + (\gamma_{020} + r_{02j})unit_{ij} + (\gamma_{030} + r_{03j})sector_{ij}unit_{ij} + u_{0ij} \right] + \\ & \left[\gamma_{100} + \gamma_{110}sector_{ij} + u_{1ij} \right]year92_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + \gamma_{010}sector_{ij} + \gamma_{020}unit_{ij} + \gamma_{020}sector_{ij}unit_{ij} + \gamma_{100}year92_{tij} + \gamma_{110}year92_{tij}sector_{ij} \right] + \\& \left[ r_{00j} + r_{01j}sector_{ij} + r_{02j}unit_{ij} + r_{03j}sector_{ij}unit_{ij} + u_{0ij} + u_{1ij} \epsilon_{tij}\right] \end{aligned}\]

Fixed effects

summary(model_f)
##  Groups                       Name        Variance  Std.Dev. Corr  
##  qcaa_district:qcaa_school_id (Intercept) 1.8582181 1.363165       
##                               year92      0.0020144 0.044882 -0.777
##  qcaa_district                (Intercept) 0.1468611 0.383225       
##  Residual                                 0.1493206 0.386420
##                                              Estimate   Std. Error    t value
## (Intercept)                               2.967847727 0.1906944799 15.5633646
## year92                                    0.026446106 0.0053039215  4.9861420
## sectorGovernment                          0.170207789 0.1829943554  0.9301259
## sectorIndependent                        -1.266171057 0.2020482537 -6.2666766
## unityear_12_enrolments                   -0.003237843 0.0157576995 -0.2054769
## year92:sectorGovernment                  -0.016269000 0.0061142842 -2.6608184
## year92:sectorIndependent                  0.031103798 0.0067695649  4.5946524
## year92:unityear_12_enrolments            -0.001164564 0.0006060924 -1.9214299
## sectorGovernment:unityear_12_enrolments  -0.046618866 0.0143680795 -3.2446136
## sectorIndependent:unityear_12_enrolments -0.039104708 0.0162432846 -2.4074384
##  Number of Level Two groups =  468 
##  Number of Level Three groups =  13

Using the model output above (see step 9 for detailed explanation on fixed effects), the estimated increase in mean enrolments for government schools is 1.0229% (\(e^{0.0264 - 0.0162} - 1) \times 100\)), which is 1.6402% (\((e^{0.01626900} - 1) \times 100\)) less than that of catholic schools. On the other hand, the mean enrolments for independent schools are estimated to increase by 5.9238% (\(e^{0.0264 - 0.0162} - 1) \times 100\)) each year, which is 3.1592% more than catholic schools.

Fixed effects of the final model for Biology subject

Figure 56: Fixed effects of the final model for Biology subject

The fixed effects can be better visualised in Figure 56, independent schools appears to have the highest average increase in enrolments per year. It appears that after 2022, enrolments in independent schools are predicted to be higher than government schools, on average. Unit 11 enrolments only appears to be marginally smaller than unit 12 enrolments for government and independent schools, but appears to be the same for catholic schools. Government schools starts of (in 1992) with the highest enrolments on average, but it is matched with a slow increase in enrolments over the years.

Random effects

Random effects for schools

Figure 57: Random effects for schools

Figure 57 displays the random effects for a given school. It is apparent that the random intercepts and slopes are negatively correlated, where a large intercept is associated with a smaller random slope. This indicates that a larger school is associated with a smaller increase (decrease) in enrolments over the years while smaller schools are predicted to have larger increase in enrolments over the years.

Random intercept for districts

Figure 58: Random intercept for districts

As the random slopes are removed, all districts are predicted to have the same increase in enrolments over the years; And as was discussed previously, this was a reasonable assumption or an otherwise perfect correlation with random slope and intercept will be fitted. Figure 58 demonstrates that schools in Brisbane Central has the largest enrolments while Mackay have the lowest enrolments in Biology subject, on average.

Predictions

Model predictions for 20 randomly selected schools

Figure 59: Model predictions for 20 randomly selected schools

Figure 59 above shows the predictions for 20 randomly selected schools.

Chemistry

Exploring the dataset with basic linear model

Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Chemistry subject

Figure 60: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Chemistry subject

With reference to the first step, Figure 60 fits a linear model for enrolments for a random sample of 20 schools. The various school sizes is apparent; To illustrate, school 631 and 647 had enrolments of less than 20 each year, while schools such as schools 90 and 555 consistently showed enrolments of approximately more than 50 students each year. Some schools discontinued the subjects (e.g. school 96) while other schools only introduced the subject in the later years (e.g. school 315).

Getting the data ready for modelling

Removing zero enrolments

All zero enrolments in a given year will be removed for modelling. As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. These zero enrolments will be removed for modelling purposes.

Linearise response variable using log transformation

Effects of log transformation for response variable (enrolments) in Chemistry subject

Figure 61: Effects of log transformation for response variable (enrolments) in Chemistry subject

As multilevel model assumes normality in the error terms, a log transformation is utilised to allow models to be estimated by the linear mixed models. The log transformation allows enrolment numbers to be approximately normally distributed (Figure 61).

Unconditional means model

Table 36: AIC values for all candidate models for Chemistry
df AIC
Model0.2: Schools nested within districts 4 28379.04
Model0.1: Schools nested within postcodes 4 28401.11
Model0.0: Within schools 3 28422.33

As per step 3, the three potential models are fitted, with the AIC shown in Table 36. Based on the AIC, model0.2, corresponding the schools nested within districts is the best model and will be used in the subsequent analysis.

Intraclass correlation (\(ICC\))

summary(model0.2)
## Random effects:
##  Groups                       Name        Variance Std.Dev.
##  qcaa_school_id:qcaa_district (Intercept) 0.58885  0.76736 
##  qcaa_district                (Intercept) 0.10767  0.32813 
##  Residual                                 0.20940  0.45760
## 
##  Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 2.883525 0.09814118 29.3814
## 
##  Number of schools (level-two group) = 454 
##  Number of district (level-three group) = 13

This model takes into account 454 schools nested in 13 districts. In a three-level multilevel model, two intraclass correlations can be obtained using the model summary output above:

The level-two ICC relates to the correlation between school \(i\) from district \(k\) in time \(t\) and in time \(t^* \ne t\):

\[\text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.5888}{(0.5888 + 0.1077 + 0.2094)} = 0.6450\]

This can be conceptualised as the correlation between enrolments of two random draws from the same school at two different years. In other words, 64.50% of the total variability is attributable to the changes overtime within schools.

The level-three ICC refers to the correlation between different schools \(i\) and \(i^*\) from a specific district \(j\) in time \(t\) and time \(t^* \ne t\).

\[\text{Level-three ICC} = \frac{\phi^2_{00}}{\tau^2_{00} + \phi^2_{00} + \sigma^2} = \frac{0.1077}{(0.5888 + 0.1077 + 0.2094)} = 0.1189\] Similarly, this can be conceptualised as the correlation between enrolments of two randomly selected schools from the same district – i.e. 11.70% of the total variability is due to the difference between districts.

Unconditional growth model

The unconditional growth model introduces the time predictor at level one, the model specification can be found in step 4. This allows for assessing within-school variability which can be attributed to linear changes over time. Furthermore, variability in intercepts and slopes can be obtained to compare schools within the same districts, and schools from different districts.

summary(model1.0)
##  Groups                       Name        Variance   Std.Dev.  Corr  
##  qcaa_district:qcaa_school_id (Intercept) 1.7739e+00 1.3318683       
##                               year92      1.8155e-03 0.0426081 -0.790
##  qcaa_district                (Intercept) 6.1621e-02 0.2482367       
##                               year92      6.1531e-05 0.0078442 0.512 
##  Residual                                 1.4798e-01 0.3846785
##               Estimate  Std. Error   t value
## (Intercept) 2.40033451 0.095359577 25.171405
## year92      0.02165636 0.003068893  7.056736
##  Number of Level Two groups =  454 
##  Number of Level Three groups =  13
  • \(\pi_{0ij}\) = 2.4003: Initial status for school \(i\) in district \(j\) (i.e. expected log enrolments when time = 0)
  • \(\pi_{1ij}\) = 0.0217: Growth rate for school \(i\) in district \(j\)
  • \(\epsilon_{tij}\) = 0.1480: Variance in within-school residuals after accounting for linear growth overtime

When the subject was first introduced in 1992, schools were expected to have 11.0265 (\(e^{1.7848}\)) enrolments, on average. Furthermore, on average, enrolments were expected to increase by 2.1893% (\((e^{0.0216564} - 1) \times 100\)) per year. The estimated within-schools variance decreased by 29.3326% (0.2094 to 0.14797754), implying that 29.3326% of within-school variability can be explained by the linear growth over time.

Testing fixed effects

Table 37: AIC for all possible models with different combinations of fixed effects
model AIC
model4.7 22651.34
model4.1 22651.36
model4.4 22653.66
model4.5 22654.01
model4.0 22656.29
model4.2 22718.07
model4.8 22718.07
model4.9 22718.07
model4.10 22718.07
model4.6 22718.07
model4.3 22720.78

As highlighted in step 6, sector and unit will be added as predictors to the model. The largest possible model will be fitted, before removing fixed effects one by one while recording the AIC for each model. In this case, model4.0 corresponds to the largest possible model while model4.10 is the smallest possible model. The model with the optimal (lowest) AIC is model4.4 (Table 37). The next section will test the selected model’s random effects to build the final model.

Parametric bootstrap to test random effects

Table 38: Parametric Bootstrap to compare larger and smaller, nested model
npar AIC BIC logLik deviance Chisq Df Pr_boot(>Chisq)
14 22651.66 22762.71 -11311.83 22623.66 NA NA NA
16 22651.34 22778.26 -11309.67 22619.34 4.311858 2 0.04
Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model

Figure 62: Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model

The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. Figure 62 displays the likelihood ratio test statistic from the null distribution, with the red line representing the likelihood ratio test statistic using the actual data. The p-value of 0.04% (Table 38) indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic. The estimated \(p\)-value is 0.04 < 0.05 fails to reject the null hypothesis at the 5% level, indicating that the larger model (without random slope at level three) is preferred.

Confidence interval

Table 39: 95% confidence intervals for fixed and random effects in the final model
var 2.5 % 97.5 %
sd_(Intercept)|qcaa_district:qcaa_school_id 1.1802941 1.3535015
cor_year92.(Intercept)|qcaa_district:qcaa_school_id -0.8005691 -0.7167320
sd_year92|qcaa_district:qcaa_school_id 0.0362919 0.0424250
sd_(Intercept)|qcaa_district 0.0854839 0.4588320
cor_year92.(Intercept)|qcaa_district -0.6818153 1.0000000
sd_year92|qcaa_district 0.0006963 0.0113808
sigma 0.3792006 0.3871578
(Intercept) 2.3667074 3.0418792
year92 0.0071077 0.0269531
sectorGovernment -0.2661168 0.3889485
sectorIndependent -1.4463471 -0.6984020
unityear_12_enrolments -0.0646203 -0.0173266
year92:sectorGovernment -0.0209668 0.0013685
year92:sectorIndependent 0.0189922 0.0427861
sectorGovernment:unityear_12_enrolments -0.0624251 -0.0069597
sectorIndependent:unityear_12_enrolments -0.0527715 0.0100726

The parametric bootstrap is utilised to construct confidence intervals (as detailed in step 8). If the confidence intervals for the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level.

The 95% confidence interval is shown above (Table 39), and the random effects all exclude 0, further reiterating that they are statistically significant at the 5% level.

Interpreting the final model

Composite model

  • Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij}\]

  • Level two (schools within districts) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij} \end{aligned}\]

  • Level three (districts) \[\begin{aligned} \beta_{00j} = \gamma_{000} + r_{00j} \\ \beta_{01j} = \gamma_{010} + r_{01j} \\ \beta_{02j} = \gamma_{020} + r_{02j} \\ \beta_{03j} = \gamma_{030} + r_{03j} \\ \beta_{10j} = \gamma_{100} + r_{10j} \\ \beta_{11j} = \gamma_{110} + r_{11j} \end{aligned}\]

Therefore, the composite model can be written as

\[\begin{aligned} Y_{tij} =\ &\pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ =\ &(\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ =\ & \left[(\gamma_{000} + r_{00j}) + (\gamma_{010} + r_{01j})sector_{ij} + (\gamma_{020} + r_{02j})unit_{ij} + (\gamma_{030} + r_{03j})sector_{ij}unit_{ij} + u_{0ij} \right] + \\ & \left[(\gamma_{100} + r_{10j}) + (\gamma_{110} + r_{11j})sector_{ij} + u_{1ij} \right]year92_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + \gamma_{010}sector_{ij} + \gamma_{020}unit_{ij} + \gamma_{020}sector_{ij}unit_{ij} + \gamma_{100}year92_{tij} + \gamma_{110}year92_{tij}sector_{ij} \right] + \\ & \left[ r_{00j} + r_{01j}sector_{ij} + r_{02j}unit_{ij} + r_{03j}sector_{ij}unit_{ij} + u_{0ij} + r_{10j}year92_{tij} + r_{11j}year92_{tij}sector_{ij} + u_{1ij} \epsilon_{tij} \right] \end{aligned}\]

Fixed effects

summary(model_f)
##  Groups                       Name        Variance   Std.Dev.  Corr  
##  qcaa_district:qcaa_school_id (Intercept) 1.6097e+00 1.2687521       
##                               year92      1.5320e-03 0.0391402 -0.765
##  qcaa_district                (Intercept) 7.7313e-02 0.2780516       
##                               year92      3.3268e-05 0.0057678 0.621 
##  Residual                                 1.4672e-01 0.3830370
##                                             Estimate  Std. Error    t value
## (Intercept)                               2.69358309 0.165949162 16.2313751
## year92                                    0.01832166 0.004966819  3.6888110
## sectorGovernment                          0.06422409 0.170475913  0.3767341
## sectorIndependent                        -1.04537987 0.188584541 -5.5432957
## unityear_12_enrolments                   -0.04196115 0.012373547 -3.3911980
## year92:sectorGovernment                  -0.01003111 0.005433864 -1.8460368
## year92:sectorIndependent                  0.03050230 0.006036915  5.0526299
## sectorGovernment:unityear_12_enrolments  -0.03559772 0.014308159 -2.4879313
## sectorIndependent:unityear_12_enrolments -0.02235697 0.016211174 -1.3791088
##  Number of Level Two groups =  454 
##  Number of Level Three groups =  13

Using the model output above (see step 9 for detailed explanation on fixed effects), the estimated increase in mean enrolments for government schools is 0.832496% (\((e^{0.0183 - 0.0100} - 1) \times 100\)), which is 1.0082% (\((e^{0.0100} - 1) \times 100\)) less than that of catholic schools. On the other hand, the mean enrolments for independent schools are estimated to increase by 5.0036% (\(e^{0.01832 + 0.03050} - 1) \times 100\)) each year, which is 3.0972% (\((e^{0.0305022}-1) \times 100\)) more than catholic schools.

Fixed effects of the final model for Chemistry

Figure 63: Fixed effects of the final model for Chemistry

The fixed effects are better accentuated in Figure 63. As aforementioned, independent school have the lowest initial status, that is, the least enrolments when the subject was first introduced. However, this low enrolments was matched with a relatively high slope compared to the other sectors. From 2021 onwards, independent schools are expected to have greater enrolments than government schools, on average. Government schools have the highest initial status, but are expected to have the smallest increase in enrolments over the years (as demonstrated by the gentle slope).

Random effects

Random effects for schools

Figure 64: Random effects for schools

Figure 64 displays the random effects for a given school. It is apparent that the random intercepts and slopes are negatively correlated, where a large intercept is associated with a smaller random slope (csorrelation = -0.76, as shown in the model output). This indicates that a larger school is associated with a smaller increase (decrease) in enrolments over the years while smaller schools are predicted to have larger increase in enrolments over the years.

Random intercept for districts

Figure 65: Random intercept for districts

The model includes a random slope at level three, and the effects of the random intercept and slope in level three is shown in Figure 65. As shown in the model output, correlation between the random intercept and slope is 0.62. Loosely speaking, this suggests that districts with large enrolments are going to be larger, while smaller districts are going to increase (decrease) at a slower rate. Based on Figure 65, Gold Coast are estimated to have the highest slope, indicating that the rate of change in enrolment is the greatest compared to the other districts.

Predictions

Model predictions for 20 randomly selected schools

Figure 66: Model predictions for 20 randomly selected schools

Figure 66 above shows the predictions for 20 randomly selected schools.

Earth and Environmental Science

Exploring the dataset with basic linear model

Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Earth and Environmental Science subject

Figure 67: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Earth and Environmental Science subject

Figure 67 fits a linear model for 20 randomly selected schools. The difference in enrolment numbers across the cohorts between schools is apparent, for example, school 405 and 532 which consistently showed little enrolments (< 20) per year while schools such as school 175 and 618 demonstrates significantly higher enrolment numbers in a single cohort.

Some schools offered the subject in little years, such as school 233 and 588, which only offered the subject in the new QCE system. There were also varying patterns in enrolment trends where school 618 showed a large increase in enrolments over the years while school 111 appears to have a stark decrease in enrolments since it offered the subject. In some cases, there were only a few students enrolments in the school, such as school 532, which only showed 1 enrolment when the school offered the subject.

Getting the data ready for modelling

Removing zero enrolments

All zero enrolments in a given year will be removed for modelling. As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. These zero enrolments will be removed for modelling purposes.

Linearise response variable using log transformation

Effects of log transformation for response variable (enrolments) in Earth and Environmental Science subject

Figure 68: Effects of log transformation for response variable (enrolments) in Earth and Environmental Science subject

The enrolments were right skewed, which is likely to be attributed to the various school sizes (as seen in Figure 67). A log transformation was implemented to the response variable (i.e. enrolments) to allow the the multilevel model to better capture the enrolment patterns.

Unconditional means model

Table 40: AIC values for all candidate models for Earth and Environmental Science
df AIC
Model0.0: Within schools 3 2243.65
Model0.1: Schools nested within postcodes 4 2244.80
Model0.2: Schools nested within districts 4 2245.65

As outlined in step 3, the three candidate models are fitted and their AIC is shown in Table 40. Based on the AIC, the two-level model (model0.0) is the superior model and will be used in the subsequent analysis.

Intraclass correlation (\(ICC\))

summary(model0.0)
## Random effects:
##  Groups         Name        Variance Std.Dev.
##  qcaa_school_id (Intercept) 1.51734  1.2318  
##  Residual                   0.29507  0.5432
## 
##  Fixed effects:
##             Estimate Std. Error  t value
## (Intercept)  2.06781  0.1325348 15.60202
## 
##  Number of schools (level-two group) = 92 
##  Number of district (level-three group) = NA

This model takes into account 92. For a two-level multilevel model, the level two intraclass correlation coefficient (ICC) can be computed using the model output above. The level-two ICC is the correlation between a school \(i\) in time \(t\) and time \(t^*\):

\[\text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{1.5173}{(1.5173 + 0.2951)} = 0.8372\] This can be conceptualised as the correlation between the enrolments of a selected school at two randomly drawn year (i.e. two randomly selected cohort from the same school). In other words, 83.72% of the total variability is attributable to the differences in enrolments within schools at different time periods.

Unconditional growth model

summary(model1.0)
##  Groups         Name        Variance  Std.Dev. Corr  
##  qcaa_school_id (Intercept) 2.0601140 1.435310       
##                 year92      0.0028797 0.053663 -0.847
##  Residual                   0.2458310 0.495813
##               Estimate  Std. Error  t value
## (Intercept) 1.63148417 0.176855363 9.224963
## year92      0.03527984 0.007508286 4.698787
##  Number of Level Two groups =  92 
##  Number of Level Three groups =  NA

The next step involves incorporating the linear growth of time into the model. The model output is shown above.

  • \(\pi_{0ij}\) = 1.6314: Initial status for school \(i\) (i.e. expected log enrolments when time = 0)
  • \(\pi_{1ij}\) = 0.0353: Growth rate for school \(i\)
  • \(\epsilon_{tij}\) = 0.2458: Variance in within-school residuals after accounting for linear growth overtime

When the subject was first introduced in 1992, schools were expected to have an average of 5.1115 (\(e^{1.6314842}\)) enrolments, which is a relatively low number as compared to the other mathematics and science subjects. Furthermore, Figure 2 demonstrated that the number of schools offering the subject in 1992 was the highest among all years.

On average, the enrolments were expected to increase by 3.59304% (\((e^{0.0353} - 1) \times 100\)) per year. The estimated within-school variance decreased by 16.70% (0.2951 to 0.2458), indicating the 16.70% can be explained by the linear growth in time.

Testing fixed effects

Table 41: AIC for all possible models with different combinations of fixed effects
model npar AIC BIC logLik
model4.4 11 2083.856 2139.727 -1030.928
model4.5 12 2085.205 2146.155 -1030.603
model4.7 13 2086.236 2152.265 -1030.118
model4.3 10 2086.805 2137.596 -1033.402
model4.1 14 2087.644 2158.753 -1029.822
model4.10 11 2087.848 2143.719 -1032.924
model4.9 11 2087.848 2143.719 -1032.924
model4.2 11 2087.848 2143.719 -1032.924
model4.8 11 2087.848 2143.719 -1032.924
model4.6 12 2089.241 2150.191 -1032.621
model4.0 16 2091.481 2172.748 -1029.740

As summarise in step 6, level-two predictors secotr and unit will be added to the model. The largest possible model (model4.0) will first be fitted, before iteratively removing fixed effects one at a time (with model4.10 being the smallest of all 10 candidate models), whilst recording the AIC for each model. model4.4 appears to have the optimal (smallest) AIC (Table 41), and will be used in the next section in building the final model.

Parametric bootstrap to test random effects

Table 42: Parametric Bootstrap to compare larger and smaller, nested model
npar AIC BIC logLik deviance Chisq Df Pr_boot(>Chisq)
9 2223.393 2269.105 -1102.696 2205.393 NA NA NA
11 2083.856 2139.727 -1030.928 2061.856 143.537 2 0
Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model

Figure 69: Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model

The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. The p-value indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic. Figure 69 displays the likelihood ratio test statistic from the null distribution, with the red line indicates the likelihood ratio test statistic using the actual data.

There is overwhelming statistical evidence (\(\chi^2\) = 143.537 and \(p\)-value = 0 from Table 42) that the larger model (including random slope at level two) is the better model.

Confidence interval

Table 43: 95% confidence intervals for fixed and random effects in the final model
var 2.5 % 97.5 %
sd_(Intercept)|qcaa_school_id 1.0791392 1.5883744
cor_year92.(Intercept)|qcaa_school_id -0.9079310 -0.6751860
sd_year92|qcaa_school_id 0.0393192 0.0678080
sigma 0.4746765 0.5167394
(Intercept) 0.5979313 2.9539799
year92 -0.0293459 0.0774094
sectorGovernment -1.6512267 0.7654638
sectorIndependent -0.4903149 2.2137154
unityear_12_enrolments -0.0447976 0.0733969
year92:sectorGovernment -0.0366734 0.0751267
year92:sectorIndependent -0.0819655 0.0454768

The parametric bootstrap is utilised to construct confidence intervals (detailed explanation in step 8) for the random effects. If the confidence intervals between the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The confidence interval for the random effects all exclude 0 (Table 43), indicating that they’re different from 0 in the population (i.e. statistically significant).

Interpreting final model

Composite model

  • Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij}\]

  • Level two (schools within districts) will contain new predictor(sector) \[\begin{aligned} \pi_{0ij} = \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + u_{0ij} \\ \pi_{1ij} = \beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij} \end{aligned}\]

The composite model can therefore be written as: \[\begin{aligned} Y_{tij} =\ & \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ =\ & (\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ =\ & \left[\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02}unit_{ij} + \beta_{10j}year92_{tij} + \beta_{11j}sector_{ij}year92_{tij} \right] + \left[u_{0ij} + u_{1ij} + \epsilon_{tij} \right] \end{aligned}\]

Fixed effects

summary(model_f)
##  Groups         Name        Variance  Std.Dev. Corr  
##  qcaa_school_id (Intercept) 1.8069480 1.344228       
##                 year92      0.0028392 0.053284 -0.824
##  Residual                   0.2457532 0.495735
##                             Estimate Std. Error    t value
## (Intercept)               1.70508906 0.63997319  2.6643133
## year92                    0.02482993 0.02880730  0.8619320
## sectorGovernment         -0.35142132 0.66866804 -0.5255542
## sectorIndependent         0.96758390 0.74465152  1.2993781
## unityear_12_enrolments    0.01447581 0.02919046  0.4959088
## year92:sectorGovernment   0.01972213 0.03014928  0.6541493
## year92:sectorIndependent -0.02215715 0.03284658 -0.6745648
##  Number of Level Two groups =  92 
##  Number of Level Three groups =  NA

Based on the model output, the estimated mean enrolments for government schools are estimated to be 29.63% (\((e^{0.3514213} - 1) \times 100\)) less than that of catholic schools when the subject was first introduced in 1992. However, government schools are estimated to have a mean increase of 4.5560% (\((e^{0.0248299 + 0.0197221} - 1) * \times 100\)) per year, which is 1.9918% (\((e^{0.0197221} - 1) * \times 100\)) greater than the increase in enrolments in catholic schools.

Independent schools are estimated to have a 163.158% (\((e^{0.9675839} - 1) \times 100\)) more than that of catholic schools in 1992. However, independent schools showed a slow increase in enrolments (0.26786%) over per year, on average. This increase in enrolments is -2.1913% (\((exp^{-0.0221572} - 1) * 100\)) less than that of catholic schools.

Fixed effects of the final model for Agricultural Practices subject

Figure 70: Fixed effects of the final model for Agricultural Practices subject

The results can be better visualised in Figure 70. On average Government schools started off with little enrolments, but showed a stark increase in enrolments over the years. In contrast, Independent schools have relatively high enrolments when the subject was first introduced in 1992, but showed little increase over the years.

Random effects

Random effects for all schools

Figure 71: Random effects for all schools

Figure 71 shows the random effects for all 92 schools that offered the subject. There is a clear negative correlation between the random intercept and the random slope, which indicates that in general, schools with lesser enrolments are generally matched with a larger increase in enrolments over the years.

Predictions

Model predictions for year 11 enrolments for 20 randomly selected schools

Figure 72: Model predictions for year 11 enrolments for 20 randomly selected schools

Figure 72 above shows the predictions for 20 randomly selected schools.

Marince Science

Exploring the dataset with basic linear model for each school

Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Marine Science

Figure 73: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Marine Science

Figure 73 shows a basic linear plot for 20 randomly selected schools. The subject was introduced in 2014, and some school offered the subjects at later years while some schools discontinued the subject (e.g. school 373). Next, some schools (e.g. school 220 and 233) showed a large increase in enrolments relative to the other schools while some schools showed a decrease in enrolments over the years. The various school sizes can be seen, where some schools have less than 25 enrolments each year while some school have over 50 enrolments per cohort.

Getting the data ready for modelling

Removing zero enrolments

As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. For these reasons, all completion years with zero enrolments will also be removed for modelling.

Linearise response variable using log transformation

Effects of log transformation for response variable (enrolments) in Marine Science

Figure 74: Effects of log transformation for response variable (enrolments) in Marine Science

The enrolments were right skewed, which is likely to be attributed to the various school sizes (as seen in Figure 74). A log transformation was implemented to the response variable (i.e. enrolments) to allow the the multilevel model to better capture the enrolment patterns.

Unconditional means model

Table 44: AIC values for all candidate models for Specialist Mathematics
df AIC
Model0.0: Within schools 3 1261.849
Model0.2: Schools nested within districts 4 1262.921
Model0.1: Schools nested within postcodes 4 1263.849

As underlined in Step 3, the three candidate models are fitted and their AIC is shown in Table 44. Based on the AIC, the two-level model (model0.0) is the best model and will be used in the subsequent analysis.

Intraclass correlation (\(ICC\))

## Random effects:
##  Groups         Name        Variance Std.Dev.
##  qcaa_school_id (Intercept) 0.44111  0.66416 
##  Residual                   0.21735  0.46620
## 
##  Fixed effects:
##             Estimate Std. Error  t value
## (Intercept) 2.955045 0.07678718 38.48358
## 
##  Number of schools (level-two group) = 81 
##  Number of district (level-three group) = NA

This model takes into account 81 schools which offer the subject. For a two-level multilevel model, the level two intraclass correlation coefficient (ICC) can be computed using the model output above.

The level-two ICC is the correlation between a school \(i\) in time \(t\) and time \(t^*\):

\[ \text{level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.4411}{(0.4411 + 0.2173)} = 0.6700\]

This can be conceptualised as the correlation between the enrolments of a selected school at two randomly drawn year (i.e. two randomly selected cohort from the same school). In other words, 67% of the total variability is attributable to the differences in enrolments within schools at different time periods.

Unconditional Growth model

##  Groups         Name        Variance  Std.Dev. Corr  
##  qcaa_school_id (Intercept) 0.4715552 0.686699       
##                 year15      0.0071891 0.084788 -0.284
##  Residual                   0.1628130 0.403501
##              Estimate Std. Error  t value
## (Intercept) 2.7409337 0.08744009 31.34642
## year15      0.0464796 0.01279031  3.63397
##  Number of Level Two groups =  81 
##  Number of Level Three groups =  NA

The next step involves incorporating the linear growth of time into the model. The model output is shown above.

  • \(\pi_{0ij}\) = 2.7409: Initial status for school \(i\) (i.e. expected log enrolments when time = 0)
  • \(\pi_{1ij}\) = 0.0465: Growth rate for school \(i\)
  • \(\epsilon_{tij}\) = 0.1628: Variance in within-school residuals after accounting for linear growth overtime

When the subject was first introduced in 2015, schools were expected to have an average of 15.5009 (\(e^{2.7409}\)) enrolments. On average, the enrolments were expected to increase by 4.7598% (\((e^{0.0465} - 1) \times 100\)) per year. The estimated within-school variance decreased by 25.07% (0.2173 to 0.162813), indicating the 25.07% can be explained by the linear growth in time.

Testing fixed effects

Table 45: AIC for all possible models with different combinations of fixed effects
model AIC
model4.4 1127.338
model4.2 1128.333
model4.8 1128.333
model4.10 1128.333
model4.9 1128.333
model4.3 1128.400
model4.7 1128.990
model4.5 1129.327
model4.6 1130.333
model4.1 1130.989
model4.0 1134.839

As summarise in step 6, level-two predictors secotr and unit will be added to the model. The largest possible model (model4.0) will first be fitted, before iteratively removing fixed effects one at a time (with model4.10 being the smallest of all 10 candidate models), whilst recording the AIC for each model. model4.4 (Table 45) appears to have the optimal (smallest) AIC, and will be used in the next section in building the final model.

Parametric bootstrap to test random effects

Table 46: Parametric Bootstrap to compare larger and smaller, nested model
npar AIC BIC logLik deviance Chisq Df Pr_boot(>Chisq)
9 1177.905 1219.838 -579.9523 1159.905 NA NA NA
11 1127.338 1178.590 -552.6690 1105.338 54.56667 2 0

The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. The p-value indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic (as indicated by the red line in Figure ??. There is overwhelming statistical evidence (\(\chi^2\) = 54.5667 and \(p\)-value = 0 – see Table 46) that the larger model (including random slope at level two) is the better model.

Confidence interval

Table 47: 95% confidence intervals for fixed and random effects in the final model
var 2.5 % 97.5 %
sd_(Intercept)|qcaa_school_id 0.5516981 0.8199168
cor_year15.(Intercept)|qcaa_school_id -0.6355994 -0.0908638
sd_year15|qcaa_school_id 0.0604489 0.1065415
sigma 0.3809136 0.4236035
(Intercept) 2.4156300 3.2727234
year15 -0.0301446 0.0911383
sectorGovernment -0.4283699 0.5614812
sectorIndependent -0.9822681 0.1326594
unityear_12_enrolments -0.1255390 -0.0104000
year15:sectorGovernment -0.0252647 0.1139009
year15:sectorIndependent -0.0999005 0.0667181

The parametric bootstrap is utilised to construct confidence intervals (detailed explanation in step 8) for the random effects. If the confidence intervals between the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The confidence interval for the random effects all exclude 0 (Table 47), indicating that they’re different from 0 in the population (i.e. statistically significant).

Interpreting final model

Composite model

  • Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij}\]

  • Level two (schools within districts) \[\begin{aligned} \pi_{0ij} = \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + u_{0ij} \\ \pi_{1ij} = \beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij} \end{aligned}\]

  • Level three (districts) \[\begin{aligned} \beta_{00j} &= \gamma_{000} + r_{00j} \\ \beta_{01j} &= \gamma_{010} + r_{01j} \\ \beta_{02j} &= \gamma_{020} + r_{02j} \\ \beta_{10j} &= \gamma_{100} + r_{10j} \\ \beta_{11j} &= \gamma_{110} + r_{11j} \end{aligned}\]

Therefore, the composite model can be written as

\[\begin{aligned} Y_{tij} =\ &\pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ =\ &(\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij})year15_{tij} +\epsilon_{tij} \\ =\ & \left[\gamma_{000} + r_{00j} + (\gamma_{010} + r_{01j})sector_{ij} + (\gamma_{020} + r_{02j})unit_{ij} + u_{0ij} \right] + \\ & \left[\gamma_{100} + r_{10j} + (\gamma_{110} + r_{11j})sector_{ij}\right]year15_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + \gamma_{010}sector_{ij} + \gamma_{020}unit_{ij} + \gamma_{100}year15_{tij} + \gamma_{110}sector_{ij}year15_{tij} \right] + \\ & \left[r_{00j} + r_{01j}sector_{ij} + r_{02j}unit_{ij} + r_{10j}year15_{tij} + r_{11j}sector_{ij}year15_{tij} + u_{0ij} + \epsilon_{tij} \right] \end{aligned}\]

Fixed effects

## Random effects:
##  Groups         Name        Variance  Std.Dev. Corr  
##  qcaa_school_id (Intercept) 0.4666221 0.683097       
##                 year15      0.0069507 0.083371 -0.425
##  Residual                   0.1615822 0.401973
## 
##  Fixed effects:
##                             Estimate Std. Error    t value
## (Intercept)               2.84400284 0.21194908 13.4183310
## year15                    0.02927534 0.03094077  0.9461739
## sectorGovernment          0.03976737 0.23892386  0.1664437
## sectorIndependent        -0.44995296 0.28698959 -1.5678372
## unityear_12_enrolments   -0.06749144 0.02913496 -2.3165101
## year15:sectorGovernment   0.03977166 0.03469870  1.1462002
## year15:sectorIndependent -0.01081485 0.04262658 -0.2537113
## 
##  Number of schools (level-two group) = 81 
##  Number of district (level-three group) = NA

Based on the model output (see detailed explanation of fixed effects in step 9), the estimated mean enrolments for government schools are estimated to increase by 7.1487% (\((e^{0.0292753+0.0397717} - 1) \times 100\)) each year, which is 4.0573% (\((e^{0.0397717} - 1) \times 100)\) more than that of catholic schools.

On the other hand, independent schools are estimated to have a 1.8632% (\((e^{0.0292753 -0.0108148} - 1) \times 100\)) increase in enrolments per year, on average. This increase is 1.0757% (\((e^{-0.0108148}) - 1) \times 100\)) less than that of catholic schools.

Fixed effects of the final model for Marine Science subject

Figure 75: Fixed effects of the final model for Marine Science subject

The fixed effects are better accentuated in Figure 75. This figure shows that government schools are expected to have the largest increase in enrolments over the years, on average. For all sectors, it appears that unit 12 enrolments are consistently less than that of unit 11 enrolments, which may suggests that students are not pursuing year 12 after completing year 11 syllabus.

Random effects

Random effects for all schools

Figure 76: Random effects for all schools

Figure 76 represents the random intercept and slope for the random effects for a given school. It is manifest that there are no clear relationship between the random intercept and slope. Some schools with low enrolments when the subject was first introduced the school saw a large increase in enrolments over the years, while some showed a sharp decrease.

Predictions

Model predictions for 20 randomly selected schools

Figure 77: Model predictions for 20 randomly selected schools

Figure 77 above shows the predictions for 20 randomly selected schools.

Science in Practice

Exploring the dataset with basic linear model

Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Specialist Mathematics subject

Figure 78: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Specialist Mathematics subject

With reference to the first step, Figure 78 fits a linear model for enrolments for a random sample of 20 schools to provide insights on the enrolments trends in Science in Practice. As demonstrated, cohort size for each school vary, with smaller cohorts in schools 517, 530, 655 (bottom) which has enrolment numbers of approximately less than 50 each year; and larger schools such as school 416, which has more than 200 enrolments in a single cohort by the end of 2021. Some schools showed a steep increase in enrolments (e.g. school 416 and 501), while some schools showed rather constant or decreasing enrolments trends (e.g. school 655).

Getting the data ready for modelling

Removing zero enrolments

All zero enrolments in a given year will be removed for modelling. As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. These zero enrolments will be removed for modelling purposes.

Linearise response variable using log transformation

Effects of log transformation for response variable (enrolments) in Specialist Mathematics subject

Figure 79: Effects of log transformation for response variable (enrolments) in Specialist Mathematics subject

As multilevel model assumes normality in the error terms, a log transformation is utilised to allow models to be estimated by the linear mixed models. The log transformation allows enrolment numbers to be approximately normally distributed (Figure 79.

Unconditional means model

Table 48: AIC values for all candidate models for Science in Practice
df AIC
Model0.0: Within schools 3 4705.261
Model0.2: Schools nested within districts 4 4705.552
Model0.1: Schools nested within postcodes 4 4707.261

As outlined in step 3, the three candidate models are fitted and their AIC is shown in Table 48. Based on the AIC, the two-level model (model0.0) is the superior model and will be used in the subsequent analysis.

Intraclass correlation (\(ICC\))

## Random effects:
##  Groups         Name        Variance Std.Dev.
##  qcaa_school_id (Intercept) 0.62634  0.79142 
##  Residual                   0.47198  0.68701
## 
##  Fixed effects:
##             Estimate Std. Error  t value
## (Intercept)  2.75816 0.06012933 45.87046
## 
##  Number of schools (level-two group) = 196 
##  Number of district (level-three group) = NA

This model takes into account 196 schools. For a two-level multilevel model, the level two intraclass correlation coefficient (ICC) can be computed using the model output above. The level-two ICC is the correlation between a school \(i\) in time \(t\) and time \(t^*\):

\[ \text{level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.6263}{(0.6263 + 0.4720)} = 0.5702\]

This can be conceptualised as the correlation between the enrolments of a selected school at two randomly drawn year (i.e. two randomly selected cohort from the same school). In other words, 57.02% of the total variability is attributable to the differences in enrolments within schools at different time periods.

Unconditional Growth model

##  Groups         Name        Variance Std.Dev. Corr  
##  qcaa_school_id (Intercept) 1.108130 1.05268        
##                 year10      0.010253 0.10126  -0.720
##  Residual                   0.306377 0.55351
##              Estimate  Std. Error  t value
## (Intercept) 1.8442381 0.097944037 18.82951
## year10      0.1004847 0.009731646 10.32556
##  Number of Level Two groups =  196 
##  Number of Level Three groups =  NA

The next step involves incorporating the linear growth of time into the model. The model output is shown above.

  • \(\pi_{0ij}\) = 1.8442: Initial status for school \(i\) (i.e. expected log enrolments when time = 0)
  • \(\pi_{1ij}\) = 0.1005: Growth rate for school \(i\)
  • \(\epsilon_{tij}\) = 0.3064: Variance in within-school residuals after accounting for linear growth overtime

When the subject was first introduced in 2010, schools were expected to have an average of 6.3230 (\(e^{1.8442}\)) enrolments, which is a relatively low number as compared to the other mathematics and science subjects. On average, the enrolments were expected to increase by 10.5724% (\((e^{0.1005} - 1) \times 100\)) per year. The estimated within-school variance decreased by 16.70% (0.4720 to 0.30638), indicating the 35.089% can be explained by the linear growth in time.

Testing fixed effects

Table 49: AIC for all possible models with different combinations of fixed effects
model npar AIC BIC logLik
model4.1 14 4022.482 4101.041 -1997.241
model4.6 12 4024.466 4091.802 -2000.233
model4.9 13 4025.268 4098.215 -1999.634
model4.7 13 4025.268 4098.215 -1999.634
model4.8 13 4025.268 4098.215 -1999.634
model4.0 16 4025.315 4115.096 -1996.657
model4.5 12 4025.633 4092.969 -2000.816
model4.3 10 4027.180 4083.293 -2003.590
model4.2 11 4027.303 4089.028 -2002.651
model4.4 11 4027.477 4089.202 -2002.739
model4.10 11 4031.656 4093.381 -2004.828

As summarise in step 6, level-two predictors sector and unit will be added to the model. The largest possible model (model4.0) will first be fitted, before iteratively removing fixed effects one at a time (with model4.10 being the smallest of all 10 candidate models), whilst recording the AIC for each model. model4.1 appears to have the optimal (smallest) AIC (Table 49), and will be used in the next section in building the final model.

Parametric bootstrap to test random effects

Table 50: Parametric Bootstrap to compare larger and smaller, nested model
npar AIC BIC logLik deviance Chisq Df Pr_boot(>Chisq)
12 4227.105 4294.442 -2101.553 4203.105 NA NA NA
14 4022.482 4101.041 -1997.241 3994.482 208.6234 2 0
Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model

Figure 80: Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model

The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. The p-value indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic. Figure 80 displays the likelihood ratio test statistic from the null distribution, with the red line indicates the likelihood ratio test statistic using the actual data.

There is overwhelming statistical evidence (\(\chi^2\) = 208.623 and \(p\)-value = 0 from Table 50) that the larger model (including random slope at level two) is the better model.

Confidence interval

Table 51: 95% confidence intervals for fixed and random effects in the final model
var 2.5 % 97.5 %
sd_(Intercept)|qcaa_school_id 0.8735519 1.1692900
cor_year10.(Intercept)|qcaa_school_id -0.8548733 -0.7237709
sd_year10|qcaa_school_id 0.0831679 0.1154846
sigma 0.5316834 0.5674079
(Intercept) 1.3083256 2.5093735
year10 -0.0246597 0.0921339
sectorGovernment -0.4923102 0.7377683
sectorIndependent -1.5531034 0.0525608
unityear_12_enrolments -0.2032051 0.1515589
year10:sectorGovernment 0.0124112 0.1429585
year10:sectorIndependent -0.0102259 0.1449415
year10:unityear_12_enrolments 0.0026681 0.0279735
sectorGovernment:unityear_12_enrolments -0.3285803 0.0027383
sectorIndependent:unityear_12_enrolments -0.4816025 -0.0720295

The parametric bootstrap is utilised to construct confidence intervals (detailed explanation in step 8) for the random effects. If the confidence intervals between the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The confidence interval for the random effects all exclude 0 (Table 51), indicating that they’re different from 0 in the population (i.e. statistically significant).

Interpreting final model

Composite model

  • Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year10_{tij} + \epsilon_{tij}\]

  • Level two (schools within districts) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12}unit_{ij} + u_{1ij} \end{aligned}\]

Therefore, the composite model can be written as

\[\begin{aligned} Y_{tij} =\ & \pi_{0ij} + \pi_{1ij}year10_{tij} + \epsilon_{tij} \\ =\ & (\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij}) + \\ & (\beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12}unit_{ij} + u_{1ij})year10_{tij} + \epsilon_{tij} \\ =\ & \left[\beta_{00j} + \beta_{01}sector_{ij} + \beta_{02}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + \beta_{10j}year10_{tij} + \beta_{11j}sector_{ij}year_{tij} + \beta_{12}unit_{ij}year10_{tij} \right] \left[u_{0ij} + u_{u_{1ij}} + \epsilon_{tij} \right] \end{aligned}\]

Fixed Effects

summary(model)
##  Groups         Name        Variance  Std.Dev. Corr  
##  qcaa_school_id (Intercept) 1.0471207 1.023289       
##                 year10      0.0097253 0.098617 -0.802
##  Residual                   0.3036501 0.551045
##                                             Estimate  Std. Error   t value
## (Intercept)                               1.92746582 0.296641578  6.497625
## year10                                    0.03386364 0.030295845  1.117765
## sectorGovernment                          0.11738215 0.315770024  0.371733
## sectorIndependent                        -0.78920786 0.393905776 -2.003545
## unityear_12_enrolments                   -0.02478714 0.091733860 -0.270207
## year10:sectorGovernment                   0.07898180 0.032078517  2.462140
## year10:sectorIndependent                  0.06729049 0.038678772  1.739727
## year10:unityear_12_enrolments             0.01479030 0.006765064  2.186277
## sectorGovernment:unityear_12_enrolments  -0.15761038 0.082352025 -1.913862
## sectorIndependent:unityear_12_enrolments -0.27451131 0.102332610 -2.682540
##  Number of Level Two groups =  196 
##  Number of Level Three groups =  NA

Based on the model output, government schools are estimated to have a mean increase of 11.9459% (\((e^{0.0248299 + 0.0197221} - 1) * \times 100\)) per year, which is 8.21846% (\((e^{0.0789818} - 1) * \times 100\)) greater than the increase in enrolments in catholic schools. Independent schools showed a large increase in enrolments (10.6447%) per year, on average. This increase in enrolments is 6.96062% (\((exp^{-0.0221572} - 1) * 100\)) more than that of catholic schools.

Fixed effects of the final model for Science in Practice

Figure 81: Fixed effects of the final model for Science in Practice

The results can be better visualised in Figure 81. Catholic schools are estimated to have smallest growth (as shown by the gentle slope) relative to the other two sectors. By the same token, year 12 enrolments appears to be increasing at a faster rate than that of year 11 units, this may indicate that students may be opting to take the unit in year 10 and re-enrol in year 12.

Random effects

Random effects for all schools

Figure 82: Random effects for all schools

Figure 82 shows the random effects for all 196 schools that offered the subject. There is a clear negative correlation (-0.80) between the random intercept and the random slope, which indicates that in general, schools with lesser enrolments are generally matched with a larger increase in enrolments over the years.

Predictions

Model predictions for year 11 enrolments for 20 randomly selected schools

Figure 83: Model predictions for year 11 enrolments for 20 randomly selected schools

Figure 83 above shows the predictions for 20 randomly selected schools.

Conclusion

Findings

This report explores the various trends in enrolments in the old and new QCE system. In general, a large increase in enrolments can be seen in the new QCE system, which is attributable to the 2007 prep year cohort leaving the school system, allowing each year (from 2020 onwards) in the new QCE system to have a full cohort.

A multilevel model was built to predict enrolments for year 11 and year 12. Some models showed interesting findings, such as having a large negative correlation with random slope and intercept; hinting that larger schools are matched with a lower change in enrolments across the years. School districts appear to play a large role in predicting enrolments, and most models showed that there were stark differences in enrolments across the different districts.

Where do we go from here?

This report offers a broad overview of the enrolments in Queensland, Australia. An advantage of using multilevel model is that both intra-school change (i.e. how school enrolments are changing overtime) and interindividual change (i.e. Differences in temporal change across schools) may be recorded. Thus, the analysis will shift to creating a (shiny) web application to zoom in at the more granular (e.g. single school) statistics and information. The interactive features of the application will allow for a more personalised platform for the different needs of users, from extracting information about a single school to comparing enrolments across the different districts. It will include It also aims to incorporate spatial analysis such as the map of Queensland schools to further scrutinise the enrolment trends in the new QCE system.

Acknowledgement

The programming language used to analyse trends in enrolments for senior secondary (Year 11 and Year 12) mathematics and science subjects in Queensland is R (4.0.2) (R Core Team 2020), and the platform used for running R language is RStudio (1.4.1717) (RStudio Team 2020).

The following packages have been included in our Rmd file.

References

Art of Smart. 2021. Everything You Need to Know about the QCE and ATAR. Art of Smart. https://artofsmart.com.au/study/qld-atar/?nab=0&utm_referrer=https%3A%2F%2Fwww.google.com%2F.
Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.
Bell, Andrew, and Kelvyn Jones. 2015. “Explaining Fixed Effects: Random Effects Modeling of Time-Series Cross-Sectional and Panel Data.” Political Science Research and Methods 3 (1): 133–53.
Ben Bolker et al. 2021. GLMM FAQ. https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#singular-models-random-effect-variances-estimated-as-zero-or-correlations-estimated-as---1.
Bolker, Ben, and David Robinson. 2021. Broom.mixed: Tidying Methods for Mixed Models. https://CRAN.R-project.org/package=broom.mixed.
Faraway, Julian J. 2016. Extending the Linear Model with r: Generalized Linear, Mixed Effects and Nonparametric Regression Models. CRC press.
(https://stats.stackexchange.com/users/1979/fabians), fabians. n.d. “How to Test Random Effects in a Multilevel Model in r.” Cross Validated. https://stats.stackexchange.com/q/4870.
Independent Schools Queensland (ISQ). 2020. Snapshot: 2020 Commonwealth Census. Queensland, Brisbane: Independent Schools Queensland (ISQ). https://www.isq.qld.edu.au/media/uxlh5zn3/qld-school-enrolments-snapshot-august-2020.pdf.
Mossman State High School. 2019. Senior Subject Selection (Year 11 2020 | Year 12 2021). Queensland: Mossman State High School. https://mossmanshs.eq.edu.au/SupportAndResources/FormsAndDocuments/Documents/Subject%20Information/2020-senior-subject-information.pdf.
Müller, Kirill. 2020. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.
Pedersen, Thomas Lin. 2020. Patchwork: The Composer of Plots. https://CRAN.R-project.org/package=patchwork.
R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Roback and Legler. 2021. Beyond Multiple Linear Regression: Applied Generalized Linear Models and Multilevel Models in r. https://bookdown.org/roback/bookdown-BeyondMLR/.
Robinson, David, Alex Hayes, and Simon Couch. 2021. Broom: Convert Statistical Objects into Tidy Tibbles. https://CRAN.R-project.org/package=broom.
RStudio Team. 2020. RStudio: Integrated Development Environment for r. Boston, MA: RStudio, PBC. http://www.rstudio.com/.
The State of Queensland (Queensland Curriculum & Assessment Authority). 2019a. Essential Mathematics General Senior Syllabus 2019. South Queensland, Brisbane: The State of Queensland (Queensland Curriculum & Assessment Authority). https://www.qcaa.qld.edu.au/downloads/senior-qce/syllabuses/snr_ess_maths_19_app_syll.pdf.
———. 2019b. Subjects: Participation by School in Senior Secondary. South Queensland, Brisbane: The State of Queensland (Queensland Curriculum & Assessment Authority). https://www.qcaa.qld.edu.au/downloads/publications/qcaa_stats_subj_part_1992-2019.xlsx.
Tierney, Nicholas, Di Cook, Miles McBain, and Colin Fay. 2020. Naniar: Data Structures, Summaries, and Visualisations for Missing Data. https://CRAN.R-project.org/package=naniar.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Xie, Yihui. 2021. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Zeileis, Achim, Jason C. Fisher, Kurt Hornik, Ross Ihaka, Claire D. McWhite, Paul Murrell, Reto Stauffer, and Claus O. Wilke. 2020. colorspace: A Toolbox for Manipulating and Assessing Colors and Palettes.” Journal of Statistical Software 96 (1): 1–49. https://doi.org/10.18637/jss.v096.i01.
Zhu, Hao. 2021. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.